Human Motion Analysis Lecture 6: Bayesian Filtering Raquel Urtasun TTI Chicago March 29, 2010 Raquel Urtasun (TTI-C) Bayesian Filtering March 29, 2010 1 / 69
Human Motion AnalysisLecture 6 Bayesian Filtering
Raquel Urtasun
TTI Chicago
March 29 2010
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 1 69
Materials used for this lecture
This lecture is based on Zhe Chenrsquos paper rdquoBayesian Filtering FromKalman Filters to Particle Filters and Beyondrdquo
I would like to thank David Fleet for his slides on the subject
To know more about sampling look at David MaKayrsquos bookrdquoInformation Theory Inference and Learning Algorithmsrdquo CambridgeUniversity Press (2003)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 2 69
Contents of todayrsquos lecture
We will look into
Stochastic Filtering Theory Kalman filtering (1940rsquos by Wiener andKolmogorov)
Bayesian Theory and Bayesian Filtering (Bayes 1763 and rediscoverby Laplace)
Monte Carlo methods and Monte Carlo Filtering (Buffon 1777modern version in the 1940rsquos in physics and 1950rsquos in statistics)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 3 69
Monte Carlo approaches
Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable
Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference
Particle filter sequential Monte Carlo used for parameter estimation andstate estimation
Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule
Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69
A few words on Particle Filters
Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)
We will look into the more general case of non-linear non-Gaussianand non-stationary distributions
Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods
We will focus on sequential Monte Carlo (ie particle filter)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69
Notation
y mdash the observationsx mdash the state
N mdash number of samplesyn0 mdash observations up to time n
xn0 mdash state up to time n
x(i)n mdash i-th sample at time n
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69
Concept of sampling
The true distribution P(x) can be approximated by an empirical distribution
P(x) =1
N
Nsumi=1
δ(xminus x(i))
whereint
XdP(x) =
intX
p(x)dx = 1
Figure Sample approximation to the density of prob distribution (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Materials used for this lecture
This lecture is based on Zhe Chenrsquos paper rdquoBayesian Filtering FromKalman Filters to Particle Filters and Beyondrdquo
I would like to thank David Fleet for his slides on the subject
To know more about sampling look at David MaKayrsquos bookrdquoInformation Theory Inference and Learning Algorithmsrdquo CambridgeUniversity Press (2003)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 2 69
Contents of todayrsquos lecture
We will look into
Stochastic Filtering Theory Kalman filtering (1940rsquos by Wiener andKolmogorov)
Bayesian Theory and Bayesian Filtering (Bayes 1763 and rediscoverby Laplace)
Monte Carlo methods and Monte Carlo Filtering (Buffon 1777modern version in the 1940rsquos in physics and 1950rsquos in statistics)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 3 69
Monte Carlo approaches
Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable
Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference
Particle filter sequential Monte Carlo used for parameter estimation andstate estimation
Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule
Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69
A few words on Particle Filters
Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)
We will look into the more general case of non-linear non-Gaussianand non-stationary distributions
Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods
We will focus on sequential Monte Carlo (ie particle filter)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69
Notation
y mdash the observationsx mdash the state
N mdash number of samplesyn0 mdash observations up to time n
xn0 mdash state up to time n
x(i)n mdash i-th sample at time n
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69
Concept of sampling
The true distribution P(x) can be approximated by an empirical distribution
P(x) =1
N
Nsumi=1
δ(xminus x(i))
whereint
XdP(x) =
intX
p(x)dx = 1
Figure Sample approximation to the density of prob distribution (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Contents of todayrsquos lecture
We will look into
Stochastic Filtering Theory Kalman filtering (1940rsquos by Wiener andKolmogorov)
Bayesian Theory and Bayesian Filtering (Bayes 1763 and rediscoverby Laplace)
Monte Carlo methods and Monte Carlo Filtering (Buffon 1777modern version in the 1940rsquos in physics and 1950rsquos in statistics)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 3 69
Monte Carlo approaches
Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable
Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference
Particle filter sequential Monte Carlo used for parameter estimation andstate estimation
Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule
Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69
A few words on Particle Filters
Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)
We will look into the more general case of non-linear non-Gaussianand non-stationary distributions
Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods
We will focus on sequential Monte Carlo (ie particle filter)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69
Notation
y mdash the observationsx mdash the state
N mdash number of samplesyn0 mdash observations up to time n
xn0 mdash state up to time n
x(i)n mdash i-th sample at time n
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69
Concept of sampling
The true distribution P(x) can be approximated by an empirical distribution
P(x) =1
N
Nsumi=1
δ(xminus x(i))
whereint
XdP(x) =
intX
p(x)dx = 1
Figure Sample approximation to the density of prob distribution (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Monte Carlo approaches
Monte Carlo techniques are stochastic sampling approaches aiming totackle complex systems that are analytically intractable
Sequential Monte Carlo allows on-line estimation by combining MonteCarlo sampling methods with Bayesian inference
Particle filter sequential Monte Carlo used for parameter estimation andstate estimation
Particle filter uses a number of independent random variables calledparticles sampled directly from the state space to represent theposterior probabilityand update the posterior by involving the new observationsthe particle system is properly located weighted and propagatedrecursively according to the Bayesian rule
Particle filters is not the only way to tackle Bayesian filtering egdifferential geometry variational methods conjugate methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 4 69
A few words on Particle Filters
Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)
We will look into the more general case of non-linear non-Gaussianand non-stationary distributions
Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods
We will focus on sequential Monte Carlo (ie particle filter)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69
Notation
y mdash the observationsx mdash the state
N mdash number of samplesyn0 mdash observations up to time n
xn0 mdash state up to time n
x(i)n mdash i-th sample at time n
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69
Concept of sampling
The true distribution P(x) can be approximated by an empirical distribution
P(x) =1
N
Nsumi=1
δ(xminus x(i))
whereint
XdP(x) =
intX
p(x)dx = 1
Figure Sample approximation to the density of prob distribution (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
A few words on Particle Filters
Kalman filtering is a special case of Bayesian filtering with linearquadratic and Gaussian assumptions (LQG)
We will look into the more general case of non-linear non-Gaussianand non-stationary distributions
Generally for non-linear filtering no exact solution can be computedhence we rely on numerical approximation methods
We will focus on sequential Monte Carlo (ie particle filter)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 5 69
Notation
y mdash the observationsx mdash the state
N mdash number of samplesyn0 mdash observations up to time n
xn0 mdash state up to time n
x(i)n mdash i-th sample at time n
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69
Concept of sampling
The true distribution P(x) can be approximated by an empirical distribution
P(x) =1
N
Nsumi=1
δ(xminus x(i))
whereint
XdP(x) =
intX
p(x)dx = 1
Figure Sample approximation to the density of prob distribution (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Notation
y mdash the observationsx mdash the state
N mdash number of samplesyn0 mdash observations up to time n
xn0 mdash state up to time n
x(i)n mdash i-th sample at time n
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 6 69
Concept of sampling
The true distribution P(x) can be approximated by an empirical distribution
P(x) =1
N
Nsumi=1
δ(xminus x(i))
whereint
XdP(x) =
intX
p(x)dx = 1
Figure Sample approximation to the density of prob distribution (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Concept of sampling
The true distribution P(x) can be approximated by an empirical distribution
P(x) =1
N
Nsumi=1
δ(xminus x(i))
whereint
XdP(x) =
intX
p(x)dx = 1
Figure Sample approximation to the density of prob distribution (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 7 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Some useful definitions
Definition
Filtering is an operation that involves the extraction of information abouta quantity of interest at time t by using data measured up to andincluding t
Definition
Prediction derives information about what the quantity of interest will beat time t + τ in the future (τ gt 0) by using data measured up to andincluding time t
Definition
Smoothing derives information about what the quantity of interest attime t prime lt t by using data measured up to and including time t (ie in theinterval [0 t])
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 8 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Stochastic filtering problem
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
where ut is the system input vector xt the state vector yt the observationswt and vt are the process noise and the measurement noise and f and g arefunctions which are potentially time varying
Figure A graphical model of the state space model (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 9 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
Figure Careful today change of notation z is now x and x is now y
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Simplified model discrete case
The generic stochastic filtering problem
xt = f(t xt ut wt) (state equation)
yt = g(t xt ut vt) (measurement equation)
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
This equations are characterized by the state transition probabilityp(xn+1|xn) and the likelihood p(yn|xn)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 10 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Stochastic filtering is an inverse problem
Given yn0 provided f and g are known one needs to find the bestestimate xn
This is an inverse problem Find the inputs sequentially with amapping function which yields the output data
This is an ill-posed problem since the inverse learning problem isone-to-many the mapping from output to input is generallynon-unique
Definition
A problem is well-posed if it satisfies existence uniqueness and stability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 11 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Intractable Bayesian problems
Normalization Given the prior p(x) and the likelihood p(y|x) theposterior p(x|y) is obtained by dividing by the normalization factorp(y)
p(x|y) =p(y|x)p(x)int
X p(y|x)p(x)dx
Marginalization Given the joint posterior the marginal posterior
p(x|y) =
intZ
p(x z|y)dz
Expectation
Ep(x|y)[f (x)] =
intX
f (x)p(x|y)dy
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 12 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation I
Let p(xn|yn0) be the conditional pdf of xn
p(xn|yn0) =p(yn0|xn)p(xn)
p(yn0)
=p(yn ynminus10|xn)p(xn)
p(yn ynminus10)
=p(yn|ynminus10 xn)p(ynminus10|xn)p(xn)
p(yn|ynminus10)p(ynminus10)
=p(yn|ynminus10 xn)p(xn|ynminus10)p(ynminus10)p(xn)
p(yn|ynminus10)p(ynminus10)p(xn)
=p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 13 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Recursive Bayesian estimation II
The posterior densisty is described with three terms
p(xn|yn0) =p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
Prior defines the knowledge of the model
p(xn|ynminus10) =
intp(xn|xnminus1)p(xnminus1|ynminus10)dxnminus1
Likelihood p(yn|xn) determines the measurement noise model
Evidence which involves
p(yn|ynminus10) =
intp(yn|xn)p(xn|ynminus10)dxn
We need to define a criteria for optimal filtering
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 14 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering I
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum mean-squared error (MMSE) defined in terms of prediction orfiltering error
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
which is aimed to find the conditional mean
xn = E [xn|yn0] =
intxnp(xn|yn0)dxn
Maximum a posteriori (MAP) It is aimed to find the mode of posteriorprobability p(xn|yn0)
Maximum likelihood (ML) which reduces to a special case of MAP wherethe prior is neglected
Minimax which is to find the median of posterior p(xn|yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 15 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering II
MMSE finds the mean
MAP finds the mode
Minimax finds the median
Figure (left) Three optimal criteria that seek different solutions for a skewedunimodal distribution (right) MAP is misleading for the multimodal distribution(Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 16 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering III
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum conditional inaccuracy defined as
Ep(xy)[minus log p(x|y)] =
intp(x y) log
1
p(x|y)dxdy
Minimum conditional KL divergence
KL(p||p) =
intp(x y) log
p(x y)
p(x|y)p(x)dxdy
where the KL is a measure of divergence between distributions such that0 le KL(p||p) le 1 The KL is 0 only when the distributions are the same
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 17 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Criteria for optimal filtering IV
An optimal filter is rdquooptimalrdquo under a particular criteria
Minimum free energy It is a lower bound of maximum log-likelihoodwhich is aimed to minimize
F(Q P) equiv EQ(x)[minus log P(x|y)]
= EQ(x)[logQ(x)
P(x|y)]minus EQ(x)[log Q(x)]
= KL(Q||P)minus H(Q)
This minimization can be done using (EM) algorithm
Q(xn+1) larr argmaxQ
F(Q P)
xn+1 larr argmaxx
F(Q P)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 18 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Which criteria to choose
All these criteria are valid for state and parameter estimation
MMSE requires the computation of the prior likelihood and evidence
MAP requires the computation of the prior and likelihood but not thedenominator (integration) and thereby more computational inexpensive
MAP estimate has a drawback especially in a high-dimensional space Highprobability density does not imply high probability mass
A narrow spike with very small width (support) can have a very high densitybut the actual probability of estimated state belonging to it is small
Hence the width of the mode is more important than its height in thehigh-dimensional case
The last three criteria are all ML oriented They are very related
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 19 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Bayesian filtering
The criterion of optimality used for Bayesian filtering is the Bayes risk ofMMSE
E [||xn minus xn||22|yn0] =
int||xn minus xn||22p(xn|yn0)dxn
Bayesian filtering is optimal in a sense that it seeks the posterior distributionwhich integrates and uses all of available information expressed byprobabilities
As time proceeds one needs infinite computing power and unlimitedmemory to calculate the optimal solution except in some special cases (eglinear Gaussian)
In general we can only seek a suboptimal or locally optimal solution
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 20 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Kalman filter revisited
In practice we are interested in the discrete simplified case
xn+1 = f(xnwn)
yn = g(xn vn)
When the dynamic system is linear Gaussian this reduces to
xn+1 = Fn+1nxn + wn
yn = Gnxn + vn
with Fn+1n the transition matrix and Gn the measurement matrix
This is the Kalman filter and we saw that by propagating sufficientstatistics (ie mean and covariance) we can solve the system analytically
In the general case it is not tractable and we will rely on approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 21 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Kalman filter Forward equations I
We start by defining the messages
α(zn) = N (zn|micronVn)
Using the HMM recursion formulas for continuous variables we have
cnα(zn) = p(xn|zn)
intα(znminus1)p(zn|znminus1)dznminus1
Substituting the conditionals we have
cnN (zn|micron Vn) = N (xn|Czn Σ)
ZN (znminus1|micronminus1 Vnminus1)N (zn|Axnminus1 Γ)dznminus1
= N (xn|Czn Σ)N (zn|Amicronminus1 Pnminus1)
Here we assume that micronminus1 and Vnminus1 are known and we have defined
Pnminus1 = AVnminus1AT + Γ
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 22 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Kalman filter Forward equations II
Given the values of micronminus1 Vnminus1 and the new observation xn we canevaluate the Gaussian marginal for zn having mean micron and covariance Vn aswell as the normalization coefficient cn
micron = Amicronminus1 + Kn(xn minus CAmicronminus1)
Vn = (IminusKnC)Pnminus1
cn = N (xn|CAmicronminus1CPnminus1CT + Σ)
where the Kalman gain matrix is defined as
Kn = Pnminus1CT (CPnminus1CT + Σ)minus1
The initial conditions are given by
micro1 = micro0 + K1(x1 minus Cmicro0) V1 = (IminusK1C)V0
c1 = N (x1|Cmicro0CV0CT + Σ) K1 = V0CT (CV0CT + Σ)minus1
Interpretation is making prediction and doing corrections with Kn
The likelihood can be computed as p(X) =prodN
n=1 cn
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 23 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Optimum non-linear filters
The use of Kalman filtering is limited by the ubiquitous nonlinearityand non-Gaussianity of physical world
The nonlinear filtering consists in finding p(x|yn0)
The number of variables is infinite but not all of them are of equalimportance
Global approach one attempts to solve a PDE instead of an ODEin linear case Numerical approximation techniques are needed tosolve the equation
Local approach finite sum approximation (eg Gaussian sum filter)linearization techniques (ie EKF) or numerical approximations (egparticle filter) are usually used
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 24 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Extended Kalman filter (EKF)
Recall the equations of motion
xn+1 = f(xnwn)
yn = g(xn vn)
These equations are linearized in the EKF
Fn+1n =df(x)
dx
∣∣∣∣x=xn
Gn+1n =dg(x)
dx
∣∣∣∣x=xn|nminus1
Then the conventional Kalman filter can be employed
Because EKF always approximates the posterior p(xn|yn0) as a Gaussianprovides poor performance when the true posterior is non-Gaussian (egheavily skewed or multimodal)
A more general solution is to rely on numerical approximations
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 25 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 26 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Monte Carlo sampling
Itrsquos brute force technique that provided that one can drawn iid samplesx(1) middot middot middot xN from probability distribution P(x) so thatint
X
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
for which E [fN ] = E [f ] and Var[fN ] = 1N Var[f ] = σ2
N
By the Kolmogorov Strong Law of Large Numbers (under some mildregularity conditions) fN (x) converges to E [f (x)] with high probability
The convergence rate is assessed by the Central Limit Theoremradic
N(
fN minus E [f ])sim N (0 σ2)
where σ2 is the variance of f (x) The error rate is of order O(Nminus12)
An important property is that the estimation accuracy is independent of thedimensionality of the state space
The variance of estimate is inversely proportional to the number of samples
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 27 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Fundamental problems of Monte Carlo estimation
Monte carlo methods approximateintX
f (x)dP(x) asymp 1
N
Nsumi=1
f(
x(i))
= fN
There are two fundamental problems
How to drawn samples from a probability distribution P(x)
How to estimate the expectation of a function wrt the distributionor density ie E [f (x)] =
intf (x)dP(x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 28 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Important properties of an estimator
Consistency An estimator is consistent if the estimator converges to thetrue value with high probability as the number of observations approachesinfinity
Unbiasedness An estimator is unbiased if its expected value is equal to thetrue value
Efficiency An estimator is efficient if it produces the smallest errorcovariance matrix among all unbiased estimators
Robustness An estimator is robust if it is insensitive to the grossmeasurement errors and the uncertainties of the model
Minimal variance
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 29 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Types of Monte Carlo sampling
Importance sampling (IS)
Rejection sampling
Sequential importance sampling
Sampling-importance resampling
Stratified sampling
Markov chain Monte Carlo (MCMC) Metropolis-Hastings and Gibbssampling
Hybrid Monte Carlo (HMC)
Quasi-Monte Carlo (QMC)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 30 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Figure Importance sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Importance Sampling I
Sample the distribution in the region of importance in order to achievecomputational efficiency
This is important for the high-dimensional space where the data is sparseand the region of interest where the target lies in is relatively small
The idea is to choose a proposal distribution q(x) in place of the trueprobability distribution p(x) which is hard-to-sampleint
f (x)p(x)dx =
intf (x)
p(x)
q(x)q(x)dx
Monte Carlo importance sampling uses N independent samples drawn fromq(x) to approximate
f =1
N
Nsumi=1
W (x(i))f (x(i))
where W (x(i)) = p(x(i))q(x(i)) are called the importance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 31 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Importance Sampling II
If the normalizing factor of p(x) is not known the importance weights canbe only evaluated up to a normalizing constant
To ensure that we importance weights are normalized
f =Nsum
i=1
W (x(i))f (x(i)) with W (x(i)) =W (x(i))sumN
i=1 W (x(i))
The variance of the estimate is given by
Var[f ] =1
NVar[f (x)W (x)] =
1
NVar[f (x)
p(x)
q(x)]
=1
N
int (f (x)p(x)
q(x)
)2
dxminus (E [f (x)])2
N
The variance can be reduced when q(x) is chosen to
match the shape of p(x) so as to approximate the true variancematch the shape of |f (x)|p(x) so as to further reduce the true variance
The estimator is biased but consistent
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 32 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on importance sampling
It provides an elegant way to reduce the variance of the estimator (possiblyeven less than the true variance)
it can be used when encountering the difficulty to sample from the trueprobability distribution directly
The proposal distribution q(x) should have a heavy tail so as to beinsensitive to the outliers
If q(middot) is not close to p(middot) the weights are very uneven thus many samplesare almost useless because of their negligible contributions
In a high-dimensional space the importance sampling estimate is likelydominated by a few samples with large importance weights
Importance sampler can be mixed with Gibbs sampling orMetropolis-Hastings algorithm to produce more efficient techniques
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 33 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
for n = 1 to N doSample u sim U(0 1)Sample x sim q(x)
if u gtp(x)
Cq(x)then
Repeat samplingend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
Figure Importance (left) and Rejection (right) sampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Rejection sampling
Rejection sampling is useful when we know (pointwise) the upper bound ofunderlying distribution or density
Assume there exists a known constant C ltinfin such that p(x) lt Cq(x) forevery x isin X the sampling
The acceptance probability for a random variable is inversely proportional tothe constant C
The choice of C is critical
if C the samples are not reliable because of low rejection rateif C inefficient sampling since the acceptance rate will be low
If the prior p(x) is used as q(x) and the likelihood p(y|x) le C and C isknown then
p(x|y) =p(y|x)p(x)
p(y)le Cq(x)
p(y)equiv C primeq(x)
and the acceptance rate for sample x is p(x|y)C primeq(x) = p(y|x)
C
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 34 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on rejection sampling
The draws obtained from rejection sampling are exact
The prerequisite of rejection sampling is the prior knowledge ofconstant C which is sometimes unavailable
It usually takes a long time to get the samples when the ratiop(x)Cq(x) is close to zero
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 35 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Importance Sampling I
A good proposal distribution is essential to the efficiency of importancesampling
but it is usually difficult to find a good proposal distribution especially ina high-dimensional space
A natural way to alleviate this problem is to construct the proposaldistribution sequentially this is sequential importance sampling
if the proposal distribution is chosen in a factorized form
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
then the importance sampling can be performed recursively
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 36 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Importance Sampling II
According to the telescope law of probability we have
p(xn0) = p(x0)p(x1|x0) middot middot middot p(xn|x0 middot middot middot xnminus1)
q(xn0) = q0(x0)q1(x1|x0) middot middot middot qn(xn|x0 middot middot middot xnminus1)
The weights can be recursively calculated as
Wn(xn0) =p(xn0)
q(xn0)= Wnminus1(xn0)
p(xn|xnminus10)
qn(xn|xnminus10)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 37 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on Sequential Importance Sampling
The advantage of SIS is that it doesnt rely on the underlying Markov chain
Many iid replicates are run to create an importance sampler whichconsequently improves the efficiency
The disadvantage of SIS is that the importance weights may have largevariances resulting in inaccurate estimate
The variance of the importance weights increases over time weightdegeneracy problem after a few iterations of algorithm only few or one ofW (x(i)) will be nonzero
We will see now that in order to cope with this situation resampling step issuggested to be used after weight normalization
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 38 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sampling Importance Resampling (SIR)
The idea is to evaluate the properties of an estimator through the empiricalcumulative distribution function (cdf) of the samples instead of the true cdf
The resampling step is aimed to eliminate the samples with smallimportance weights and duplicate the samples with big weights
Sample N random samples x(i)Ni=1 from q(x)
for i = 1 middot middot middot N do
W (i) prop p(x(i))
q(x(i))
end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResample with replacement N times from the discrete set x(i)N
i=1 where the probability of
resampling from each x(i) is proportional to W (x(i))
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 39 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on Sampling Importance Resampling
Resampling can be taken at every step or only taken if regarded necessary
Deterministic resampling is taken at every k time step (usuallyk = 1)Dynamic resampling is taken only when the variance of theimportance weights is over the threshold
The particles and associated importance weights x(i) W (i) are replaced bythe new samples with equal importance weights (ie W (i) = 1N)
Resampling is important because
if importance weights are uneven distributed propagating the trivialweights through the dynamic system is a waste of computing powerwhen the importance weights are skewed resampling can providechances for selecting important samples and rejuvenate the sampler
Resampling does not necessarily improve the current state estimate becauseit also introduces extra Monte Carlo variation
There are many types of resampling methods
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 40 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Gibbs sampling
Itrsquos a particular type of Markov Chain Monte Carlo (MCMC) sampling
The Gibbs sampler uses the concept of alternating (marginal) conditionalsampling
Given an Nx -dimensional state vector x = [x1 x2 middot middot middot xNx ]T we areinterested in drawing the samples from the marginal density in the casewhere joint density is inaccessible or hard to sample
Since the conditional density to be sampled is low dimensional the Gibbssampler is a nice solution to estimation of hierarchical or structuredprobabilistic model
Draw a sample from x0 sim p(x0)for n = 1 to M do
for i = 1 to Nx doDraw a sample xin sim p(xn|x1n middot middot middot ximinus1n xinminus1 middot middot middot xNx nminus1)
end forend for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 41 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Illustration of Gibbs sampling
Figure Gibbs sampling in a two-dimensional space (Chen 03) Left Startingfrom state xn x1 is sampled from the conditional pdf p(x1|x2nminus1) Middle Asample is drawn from the conditional pdf p(x2|x1n) Right Four-step iterationsin the probability space (contour)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 42 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Other sampling strategies
Stratified sampling distribute the samples evenly (or unevenlyaccording to their respective variance) to the subregions dividing thewhole space
Stratified sampling works very well and is efficient in a not-too-highdimension space
Hybrid Monte Carlo Metropolis method which uses gradientinformation to reduce random walk behavior
This is good since the gradient direction might indicate the way to findthe state with a higher probability
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 43 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Numerical approximations
Monte-carlo sampling approximation (ie particle filter)
GaussianLaplace approximation
Iterative quadrature
Multi-grid method and point-mass approximation
Moment approximation
Gaussian sum approximation
Deterministic sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 44 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
GaussLaplace approximation
Gaussian approximation is the simplest method to approximate thenumerical integration problem because of its analytic tractability
By assuming the posterior as Gaussian the nonlinear filtering can be takenwith the EKF method
Laplace approximation method is to approximate the integral of a functionintf (x)dx by fitting a Gaussian at the maximum x of f (x) and further
compute the volumeintf (x)dx asymp (2π)Nx2f (x)| minus 55 log f (x)|minus12
The covariance of the fitted Gaussian is determined by the Hessian matrix oflog f (x) at x
It is also used to approximate the posterior distribution with a Gaussiancentered a the MAP estimate
Works for the unimodal distributions but produces a poor approximationresult for multimodal distributions especially in high-dimensional spaces
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 45 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Iterative Quadrature
Numerical approximation method which was widely used in computergraphics and physics
A finite integral is approximated by a weighted sum of samples of theintegrand based on some quadrature formulaint b
a
f (x)p(x)dx asympmsum
k=1
ck f (xk )
where p(x) is treated as a weighting function and xk is the quadraturepoint
The values xk are determined by the weighting function p(x) in the interval[a b]
This method can produce a good approximation if the nonlinear function issmooth
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 46 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Muti-grid Method and Point-Mass Approximation
If the state is discrete and finite (or it can be discretized and approximatedas finite) grid-based methods can provide a good solution and optimal wayto update the filtered density p(xn|yn0)
If the state space is continuous we can always discretize the state space intoNz discrete cell states then a grid-based method can be further used toapproximate the posterior density
The disadvantage of grid-based method is that it requires the state spacecannot be partitioned unevenly to give a great resolution to the state withhigh density
In the point-mass method uses a simple rectangular grid The density isassumed to be represented by a set of point masses which carry theinformation about the data
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 47 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Moment Approximation
Moment approximation is targeted at approximating the moments of thedensity including mean covariance and higher order moments
We can empirically use the sample moment to approximate the truemoment namely
mk = E [xk ] =
intX
xk p(x)dx =1
N
Nsumi=1
|x(i)|k
where mk denotes the k-th order moment and x(i) are the samples from truedistribution
The computation cost of these approaches are rather prohibitive especiallyin highdimensional space
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 48 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Gaussian Sum Approximation
Gaussian sum approximation uses a weighted sum of Gaussian densities toapproximate the posterior density (the so-called Gaussian mixture model)
p(x) =msum
j=1
cjN (xf Σf )
where the weighting coefficients cj gt 0 andsumm
j=1 cj = 1
Any non-Gaussian density can be approximated to some accurate degree bya sufficiently large number of Gaussian mixture densities
A mixture of Gaussians admits tractable solution by calculating individualfirst and second order moments
Gaussian sum filter essentially uses this idea and runs a bank of EKFs inparallel to obtain the suboptimal estimate
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 49 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Illustration of numerical approximations
Figure Illustration of non-Gaussian distribution approximation (Chen 03) (a) true distribution(b) Gaussian approximation (c) Gaussian sum approximation (d) histogram approximation (e)Riemannian sum (step function) approximation (f) Monte Carlo sampling approximation
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 50 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
What have we seen
We have seen up to now
Filtering equations
Monte Carlo sampling
Other numerical approximation methods
Whatrsquos next
Particle filters
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 51 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Particle filter Sequential Monte Carlo estimation
Now we now how to do numerical approximations Letrsquos use it
Sequential Monte Carlo estimation is a type of recursive Bayesian filterbased on Monte Carlo simulation It is also called bootstrap filter
The state space is partitioned as many parts in which the particles are filledaccording to some probability measure The higher probability the denserthe particles are concentrated
The particle system evolves along the time according to the state equationwith evolving pdf determined by the FPK equation
Since the pdf can be approximated by the point-mass histogram by randomsampling of the state space we get a number of particles representing theevolving pdf
However since the posterior density model is unknown or hard to sample wewould rather choose another distribution for the sake of efficient sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 52 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation I
The posterior distribution or density is empirically represented by a weightedsum of N samples drawn from the posterior distribution
p(xn|yn0) asymp 1
N
Nsumi=1
δ(xn minus x(i)n ) equiv p(xn|yn0)
where x(i)n are assumed to be iid drawn from p(xn|yn0)
By this approximation we can estimate the mean of a nonlinear function
E [f (xn)] asympint
f (xn)p(xn|yn0)dxn
=1
N
Nsumi=1
intf (xn)δ(xn minus x(i)
n )dxn
=1
N
Nsumi=1
f (x(i)n ) equiv fN (x)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 53 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation II
It is usually impossible to sample from the true posterior it is common tosample from the so-called proposal distribution q(xn|yn0) Letrsquos define
Wn(xn) =p(yn0|xn)p(xn)
q(xn|yn0)
We can then write
E [f (xn)] =
intf (xn)
p(xn|yn0)
q(xn|yn0)q(xn|yn0)dxn
=
intf (xn)
Wn(xn)
p(yn0)q(xn|yn0)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
p(yn0|xn)p(xn)dxn
=
intf (xn)Wn(xn)q(xn|yn0)dxnint
Wn(xn)q(xn|yn0)dxn
=Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 54 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation III
We have written
E [f (xn)] =Eq(xn|yn0)[Wn(xn)f (xn)]
Eq(xn|yn0)[Wn(xn)]
By drawing the iid samples x(i)n from q(xn|yn0) we can approximate
E [f (xn)] asymp1N
sumNi=1 Wn(x
(i)n )f (x
(i)n )
1N
sumNi=1 Wn(x
(i)n )
=Nsum
i=1
W (x(i)n )f (x(i)
n ) equiv f (x)
where the normalized weights are defined as
W (x(i)n ) =
Wn(x(i)n )sumN
i=1 Wn(x(i)n )
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 55 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Monte Carlo estimation IV
Suppose now that the proposal distribution factorizes
q(xn0|yn0) = q(x0)nprod
t=1
q(xt |xtminus10 yt0)
As before the posterior can be written as
p(xn0|yn0) = p(xnminus10|ynminus10)p(yn|xn)p(xn|ynminus10)
p(yn|ynminus10)
We can then create a recursive rule to update the weights
W (i)n =
p(x(i)n0|yn0)
q(x(i)n0|yn0)
propp(yn|x(i)
n )p(x(i)n |x(i)
nminus1)p(x(i)nminus10|ynminus10)
q(x(i)n |x(i)
nminus10 yn0)q(x(i)nminus10|ynminus10)
= W(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn0)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 56 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Types of filters
Depending on the type of sampling use we have different types of filters
Sequential Importance sampling (SIS) filter
SIR filter
Auxiliary particle filter (APF)
Rejection particle filter
MCMC particle filter
etc
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 57 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Importance sampling (SIS) filter I
We are more interested in the current filtered estimate p(xn|yn0) thanp(xn0|yn0)
Letrsquos assume that q(x(i)n |x(i)
nminus10 yn0) = q(x(i)n |x(i)
nminus10 yn) then we can write
W (i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
The problem of the SIS filter is that the distribution of the importanceweights becomes more and more skewed as time increases
After some iterations only very few particles have non-zero importanceweights This is often called weight degeneracy or sampleimpoverishment
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 58 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Sequential Importance sampling (SIS) filter II
A solution is to multiply the particles with high normalized importanceweights and discard the particles with low normalized importance weightswhich can be be done in the resampling step
A suggested measure for degeneracy is the so-called effective sample size
Neff =N
Eq(middot|yn0)[(W (xn0))2]le N
In practice this cannot be computed so we approximate
Neff asymp1sumN
i=1(W (xn0))2
When Neff is below a threshold P then resampling is performed
Neff can be also used to combine rejection and importance sampling
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 59 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
SIS particle filter with resampling
for n = 0 middot middot middot T dofor i = 1 middot middot middot N do
Draw samples x(i)n sim q(xn|x(i)
nminus10 yn0)
Set x(i)n0 = x(i)
nminus10 x(i)n
end forfor i = 1 middot middot middot N do
Calculate weights W(i)n = W
(i)nminus1
p(yn|x(i)n )p(x
(i)n |x
(i)nminus1)
q(x(i)n |x
(i)nminus10yn)
end forfor i = 1 middot middot middot N do
Normalize the weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forCompute Neff = 1PN
i=1(W (xn0))2
if Neff lt P then
Generate new x(j)n by resampling with replacement N times from x(i)
n0 with
probability P(x(j)n0 = x
(i)n0) = W
(i)n0
Reset the weights W(i)n = 1
Nend if
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 60 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
BootstrapSIR filter
The key idea of SIR filter is to introduce the resampling step as in theSIR sampling
Resampling does not really prevent the weight degeneracy problem itjust saves further calculation time by discarding the particlesassociated with insignificant weights
It artificially concealing the impoverishment by replacing the highimportant weights with many replicates of particles therebyintroducing high correlation between particles
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 61 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
SIR filter using transition prior as proposal distribution
for i = 1 middot middot middot N do
Sample x(i)0 sim p(x0)
Compute W(i)0 = 1
Nend forfor n = 0 middot middot middot T do
for i = 1 middot middot middot N do
Importance sampling x(i)n sim p(xn|x(i)
nminus1)end forSet x
(i)n0 = x(i)
nminus10 x(i)n
for i = 1 middot middot middot N do
Weight update W(i)n = p(yn|x(i)
n )end forfor i = 1 middot middot middot N do
Normalize weights W (x(i)) = W (x(i))PNi=1 W (x(i))
end forResampling Generate N new particles x
(i)n from the set x(i)
n according to W(i)n
end for
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 62 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Illustration of a generic particle filter
Figure Particle filter with importance sampling and resampling (Chen 03)
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 63 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Remarks on SIS and SIR filters
In the SIR filter the resampling is always performed
In the SIS filter importance weights are calculated sequentially resamplingis only taken whenever needed SIS filter is less computationally expensive
The choice of proposal distributions in SIS and SIR filters plays an crucialrole in their final performance
Normally the posterior estimate (and its relevant statistics) should becalculated before resampling
In the resampling stage the new importance weights of the survivingparticles are not necessarily reset to 1N but rather more clever strategies
To alleviate the sample degeneracy in SIS filter we can change
Wn = W αnminus1
p(yn|x(i)n )p(x
(i)n |x(i)
nminus1)
q(x(i)n |x(i)
nminus10 yn)
where 0 lt α lt 1 is the annealing factor that controls the impact of previousimportance weights
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 64 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Popular CONDENSATION
Figure CONDENSATION
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 65 69
Popular CONDENSATION
Figure Head tracking Figure Leaf tracking
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 66 69
Popular CONDENSATION
Figure Hand tracking Figure Hand drawing
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 67 69
Popular CONDENSATION
Figure Hand tracking Figure Interactive applications
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 68 69
More
If you want to learn more look at the additional material
Otherwise do the research project on this topic
Next week we will do human pose estimation
Letrsquos do some exercises now
Raquel Urtasun (TTI-C) Bayesian Filtering March 29 2010 69 69