The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology [email protected], http://uqgroup.mit.edu 7 July 2015 Marzouk (MIT) ICERM IdeaLab 7 July 2015 1 / 29
35
Embed
The Bayesian approach to inverse problems€¦ · Statistical inference Why is a statistical perspective useful in inverse problems? To characterize uncertainty in the inverse solution
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Why is a statistical perspective useful in inverse problems?
To characterize uncertainty in the inverse solution
To understand how this uncertainty depends on the number and
quality of observations, features of the forward model, prior
information, etc.
To make probabilistic predictions
To choose “good” observations or experiments
To address questions of model error, model validity, and model
selection
Marzouk (MIT) ICERM IdeaLab 7 July 2015 2 / 29
Bayesian inference
Bayes’ rule
p(θ|y) =p(y |θ)p(θ)
p(y)
Key idea: model parameters θ are treated as random variables
(For simplicity, we let our random variables have densities)
Notation
θ are model parameters; y are the data; assume both to be
finite-dimensional unless otherwise indicated
p(θ) is the prior probability density
L(θ) ≡ p(y |θ) is the likelihood function
p(θ|y) is the posterior probability density
p(y) is the evidence, or equivalently, the marginal likelihood
Marzouk (MIT) ICERM IdeaLab 7 July 2015 3 / 29
Bayesian inference
Summaries of the posterior distribution
What information to extract?
Posterior mean of θ; maximum a posteriori (MAP) estimate of θ
Posterior covariance or higher moments of θ
Quantiles
Credibile intervals: C (y) such that P [θ ∈ C (y) | y ] = 1− α.
Credibile intervals are not uniquely defined above; thus consider, for
example, the HPD (highest posterior density) region.
Posterior realizations: for direct assessment, or to estimate posterior
predictions or other posterior expectations
Marzouk (MIT) ICERM IdeaLab 7 July 2015 4 / 29
Bayesian and frequentist statistics
Understanding both perspectives is useful and important. . .
Key differences between these two statistical paradigms
Frequentists do not assign probabilities to unknown parameters θ.
One can write likelihoods pθ(y) ≡ p(y |θ) but not priors p(θ) or
posteriors. θ is not a random variable.
In the frequentist viewpoint, there is no single preferred
methodology for inverting the relationship between parameters and
data. Instead, consider various estimators θ(y) of θ.
The estimator θ is a random variable. Why? Frequentist paradigm
considers y to result from a random and repeatable experiment.
Marzouk (MIT) ICERM IdeaLab 7 July 2015 5 / 29
Bayesian and frequentist statistics
Key differences (continued)
Evaluate quality of θ through various criteria: bias, variance,
mean-square error, consistency, efficiency, . . .
One common estimator is maximum likelihood:
θML = argmaxθ p(y |θ). p(y |θ) defines a family of distributions
indexed by θ.
Link to Bayesian approach: MAP estimate maximizes a “penalized
likelihood.”
What about Bayesian versus frequentist prediction of ynew ⊥ y | θ?
Frequentist: “plug-in” or other estimators of ynewBayesian: posterior prediction via integration
Marzouk (MIT) ICERM IdeaLab 7 July 2015 6 / 29
Bayesian inference
Likelihood functions
In general, p(y |θ) is a probabilistic model for the data
In the inverse problem or parameter estimation context, the
likelihood function is where the forward model appears, along with a
noise model and (if applicable) an expression for model discrepancy
Contrasting example (but not really!): parametric density
estimation, where the likelihood function results from the probability
density itself.
Selected examples of likelihood functions1 Bayesian linear regression2 Nonlinear forward model g(θ) with additive Gaussian noise3 Nonlinear forward model with noise + model discrepancy
Marzouk (MIT) ICERM IdeaLab 7 July 2015 7 / 29
Bayesian inference
Prior distributions
In ill-posed parameter estimation problems, e.g., inverse problems,
prior information plays a key role
Intuitive idea: assign lower probability to values of θ that you don’t
expect to see, higher probability to values of θ that you do expect to
see
Examples1 Gaussian processes with specified covariance kernel2 Gaussian Markov random fields3 Gaussian priors derived from differential operators4 Hierarchical priors5 Besov space priors6 Higher-level representations (objects, marked point processes)
Marzouk (MIT) ICERM IdeaLab 7 July 2015 8 / 29
Gaussian process priors
Key idea: any finite-dimensional distribution of the stochastic
process θ(x, ω) : D × Ω→ R is multivariate normal.
In other words: θ(x, ω) is a collection of jointly Gaussian random
variables, indexed by x
Specify via mean function and covariance function
E [θ(x)] = µ(x)E [(θ(x)− µ) (θ(x′)− µ)] = C (x, x′)
Smoothness of process is controlled by behavior of covariance
function as x′ → xRestrictions: stationarity, isotropy, . . .
Marzouk (MIT) ICERM IdeaLab 7 July 2015 9 / 29
Example: stationary Gaussian random fieldsGaussian process priors
Then θ(x) is a Gaussian field with Matern covariance:
C (x, x′) =σ2
2ν−1Γ(ν)(κ‖x− x′‖)νKν (κ‖x− x′‖)
Covariance kernel is Green’s function of differential operator(κ2 −∆
)βC (x, x′) = δ(x− x′)
ν = 1/2 equivalent to exponential covariance; ν →∞ equivalent to
squared exponential covariance
Can construct a discrete GMRF that approximates the solution ofSPDE (See Lindgren, Rue, Lindstrom JRSSB 2011.)
Marzouk (MIT) ICERM IdeaLab 7 July 2015 13 / 29
Hierarchical Gaussian priors
Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo
0 0.2 0.4 0.6 0.8 1
!0.2
!0.1
0
0.1
0.2
0.3
0.4
0.5
0 0.2 0.4 0.6 0.8 1
! 0.2
! 0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 1. Three realization drawn from the prior (6) with constant variance !j = !0 (left) and fromthe corresponding prior where the variance is 100–fold at two points indicated by arrows (right).
where X and W are the n-variate random variables with components Xj and Wj , respectively,and
L =
!
"""#
1!1 1
. . .. . .
!1 1
$
%%%&, D = diag(!1, !2, . . . , !n). (5)
Since W is a standard normal random variable, relation (4) allows us to write the (prior)probability density of X as
"prior(x) " exp'! 1
2#D!1/2LX#2(. (6)
Not surprisingly, the first-order autoregressive Markov model leads to a first-order smoothnessprior for the variable X. The variance vector ! expresses the expected variability of the signalover the support interval, and provides a handle to control the qualitative behavior of the signal.Assume, for example, that we set !j = !0 = const., 1 ! j ! n, leading to a homogenoussmoothness over the support interval. By changing some of the components, e.g., setting!k = !# = 100!0 for some k, #, we expect the signal to have jumps of standard deviation$
!k =$
!# = 10$
!0 at the grid intervals [tk!1, tk] and [t#!1, t#]. This is illustrated infigure 1, where we show some random draws from the prior. It is important to note that thehigher values of !j s do not force, but make the jumps simply more likely to occur by increasingthe local variance.
This observation suggests that when the number, location and expected amplitudes of thejumps are known, that is, when the prior information is quantitative, the first-order Markovmodel provides the means to encode the available information into the prior. Suppose now thatthe only available information about the solution of the inverse problem is qualitative: jumpsmay occur, but there is no available information of how many, where and how large. Adheringto the Bayesian paradigm, we express this lack of quantitative information by modeling thevariance of the Markov process as a random variable. The estimation of the variance vectorthus becomes a part of the inverse problem.
Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo
Iteration 1 Iteration 3 Iteration 5
Iteration 1 Iteration 3 Iteration 5
Figure 4. Approximation of the MAP Estimate of the image (top row) and of the variance (bottomrow) after 1, 3 and 5 iteration of the cyclic algorithm when using the GMRES method to computethe updated of the image at each iteration step.
Iteration 1 Iteration 3 Iteration 7
Iteration 1 Iteration 3 Iteration 7
Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottomrow) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to computethe updated of the image at each iteration step
The graphs displayed in figure 6 refer to the CGLS iteration with inverse gamma hyperprior.The value of the objective function levels off after five iterations, and this could be the basisfor a stopping criterion. Note that after seven iterations, the norm of the estimation errorstarts to grow again, typical of algorithms which exhibit semi-convergence. The specklingphenomenon, by which individual pixel values close to the discontinuity start to divergeis partly responsible for the growth of the error. This suggests that the iterations should bestopped soon after the settling of the objective function. The fact that the norm of the derivativeis small already at the end of the first iterations which indicate that the sequential iterationfinds indeed a good approximation to a minimizer.