The Bayesian approach to inverse problems€¦ · Statistical inference Why is a statistical perspective useful in inverse problems? To characterize uncertainty in the inverse solution

The Bayesian approach to inverse problems

Youssef Marzouk

Department of Aeronautics and Astronautics

Center for Computational Engineering

Massachusetts Institute of Technology

[email protected], http://uqgroup.mit.edu

7 July 2015

Marzouk (MIT) ICERM IdeaLab 7 July 2015 1 / 29

http://uqgroup.mit.edu

Statistical inference

Why is a statistical perspective useful in inverse problems?

To characterize uncertainty in the inverse solution

To understand how this uncertainty depends on the number and

quality of observations, features of the forward model, prior

information, etc.

To make probabilistic predictions

To choose “good” observations or experiments

To address questions of model error, model validity, and model

selection


Bayesian inference

Bayes’ rule

p(θ|y) =p(y |θ)p(θ)

p(y)

Key idea: model parameters θ are treated as random variables

(For simplicity, we let our random variables have densities)

Notation

θ are model parameters; y are the data; assume both to be

finite-dimensional unless otherwise indicated

p(θ) is the prior probability density

L(θ) ≡ p(y |θ) is the likelihood function

p(θ|y) is the posterior probability density

p(y) is the evidence, or equivalently, the marginal likelihood


Bayesian inference

Summaries of the posterior distribution

What information to extract?

Posterior mean of θ; maximum a posteriori (MAP) estimate of θ

Posterior covariance or higher moments of θ

Quantiles

Credibile intervals: C (y) such that P [θ ∈ C (y) | y ] = 1− α.

Credibile intervals are not uniquely defined above; thus consider, for

example, the HPD (highest posterior density) region.

Posterior realizations: for direct assessment, or to estimate posterior

predictions or other posterior expectations


Bayesian and frequentist statistics

Understanding both perspectives is useful and important. . .

Key differences between these two statistical paradigms

Frequentists do not assign probabilities to unknown parameters θ.

One can write likelihoods pθ(y) ≡ p(y |θ) but not priors p(θ) or

posteriors. θ is not a random variable.

In the frequentist viewpoint, there is no single preferred

methodology for inverting the relationship between parameters and

data. Instead, consider various estimators θ(y) of θ.

The estimator θ is a random variable. Why? Frequentist paradigm

considers y to result from a random and repeatable experiment.


Bayesian and frequentist statistics

Key differences (continued)

Evaluate quality of θ through various criteria: bias, variance,

mean-square error, consistency, efficiency, . . .

One common estimator is maximum likelihood:

θML = argmaxθ p(y |θ). p(y |θ) defines a family of distributions

indexed by θ.

Link to Bayesian approach: MAP estimate maximizes a “penalized

likelihood.”

What about Bayesian versus frequentist prediction of ynew ⊥ y | θ?

Frequentist: “plug-in” or other estimators of ynewBayesian: posterior prediction via integration


Bayesian inference

Likelihood functions

In general, p(y |θ) is a probabilistic model for the data

In the inverse problem or parameter estimation context, the

likelihood function is where the forward model appears, along with a

noise model and (if applicable) an expression for model discrepancy

Contrasting example (but not really!): parametric density

estimation, where the likelihood function results from the probability

density itself.

Selected examples of likelihood functions1 Bayesian linear regression2 Nonlinear forward model g(θ) with additive Gaussian noise3 Nonlinear forward model with noise + model discrepancy


Bayesian inference

Prior distributions

In ill-posed parameter estimation problems, e.g., inverse problems,

prior information plays a key role

Intuitive idea: assign lower probability to values of θ that you don’t

expect to see, higher probability to values of θ that you do expect to

see

Examples1 Gaussian processes with specified covariance kernel2 Gaussian Markov random fields3 Gaussian priors derived from differential operators4 Hierarchical priors5 Besov space priors6 Higher-level representations (objects, marked point processes)


Gaussian process priors

Key idea: any finite-dimensional distribution of the stochastic

process θ(x, ω) : D × Ω→ R is multivariate normal.

In other words: θ(x, ω) is a collection of jointly Gaussian random

variables, indexed by x

Specify via mean function and covariance function

E [θ(x)] = µ(x)E [(θ(x)− µ) (θ(x′)− µ)] = C (x, x′)

Smoothness of process is controlled by behavior of covariance

function as x′ → xRestrictions: stationarity, isotropy, . . .


Example: stationary Gaussian random fieldsGaussian process priors

•! Prior is a stationary Gaussian random field:

(exponential covariance kernel) (Gaussian covariance kernel)

M(x,!) = µ(x) + "ici(!) #

i(x)

i=1

K

$(Karhunen-Loève expansion)

Both are θ(x, ω) : D × Ω→ R, with D = [0, 1]2.


Gaussian Markov random fields

Key idea: discretize space and specify a sparse inverse covariance

(“precision”) matrix W

p(θ) ∝ exp

(−

1

2γθTWθ

)where γ controls scale

Full conditionals p(θi |θ∼i) are available analytically and may simplify

dramatically. Represent as an undirected graphical model

Example: E [θi |θ∼i ] is just an average of site i ’s nearest neighbors

Quite flexible; even used to simulate textures


Priors through differential operators

Key idea: return to infinite-dimensional setting; again penalize

roughness in θ(x)

Stuart 2010: define the prior using fractional negative powers of the

Laplacian A = −∆:

θ ∼ N(θ0, βA−α

)Sufficiently large α (α > d/2), along with conditions on the

likelihood, ensures that posterior measure is well defined


GPs, GMRFs, and SPDEs

In fact, all three “types” of Gaussian priors just described are closely

connected.

Linear fractional SPDE:(κ2 −∆

)β/2θ(x) =W(x), x ∈ Rd , β = ν + d/2, κ > 0, ν > 0

Then θ(x) is a Gaussian field with Matern covariance:

C (x, x′) =σ2

2ν−1Γ(ν)(κ‖x− x′‖)νKν (κ‖x− x′‖)

Covariance kernel is Green’s function of differential operator(κ2 −∆

)βC (x, x′) = δ(x− x′)

ν = 1/2 equivalent to exponential covariance; ν →∞ equivalent to

squared exponential covariance

Can construct a discrete GMRF that approximates the solution ofSPDE (See Lindgren, Rue, Lindstrom JRSSB 2011.)


Hierarchical Gaussian priors

Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo

0 0.2 0.4 0.6 0.8 1

!0.2

!0.1

0

0.1

0.2

0.3

0.4

0.5

0 0.2 0.4 0.6 0.8 1

! 0.2

! 0.1

0

0.1

0.2

0.3

0.4

0.5

Figure 1. Three realization drawn from the prior (6) with constant variance !j = !0 (left) and fromthe corresponding prior where the variance is 100–fold at two points indicated by arrows (right).

where X and W are the n-variate random variables with components Xj and Wj , respectively,and

L =

!

"""#

1!1 1

. . .. . .

!1 1

$

%%%&, D = diag(!1, !2, . . . , !n). (5)

Since W is a standard normal random variable, relation (4) allows us to write the (prior)probability density of X as

"prior(x) " exp'! 1

2#D!1/2LX#2(. (6)

Not surprisingly, the first-order autoregressive Markov model leads to a first-order smoothnessprior for the variable X. The variance vector ! expresses the expected variability of the signalover the support interval, and provides a handle to control the qualitative behavior of the signal.Assume, for example, that we set !j = !0 = const., 1 ! j ! n, leading to a homogenoussmoothness over the support interval. By changing some of the components, e.g., setting!k = !# = 100!0 for some k, #, we expect the signal to have jumps of standard deviation$

!k =$

!# = 10$

!0 at the grid intervals [tk!1, tk] and [t#!1, t#]. This is illustrated infigure 1, where we show some random draws from the prior. It is important to note that thehigher values of !j s do not force, but make the jumps simply more likely to occur by increasingthe local variance.

This observation suggests that when the number, location and expected amplitudes of thejumps are known, that is, when the prior information is quantitative, the first-order Markovmodel provides the means to encode the available information into the prior. Suppose now thatthe only available information about the solution of the inverse problem is qualitative: jumpsmay occur, but there is no available information of how many, where and how large. Adheringto the Bayesian paradigm, we express this lack of quantitative information by modeling thevariance of the Markov process as a random variable. The estimation of the variance vectorthus becomes a part of the inverse problem.

4

Calvetti & Somersalo, Inverse Problems 24 (2008) 034013.


Hierarchical Gaussian priors

Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo

Iteration 1 Iteration 3 Iteration 5


Figure 4. Approximation of the MAP Estimate of the image (top row) and of the variance (bottomrow) after 1, 3 and 5 iteration of the cyclic algorithm when using the GMRES method to computethe updated of the image at each iteration step.



Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottomrow) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to computethe updated of the image at each iteration step

The graphs displayed in figure 6 refer to the CGLS iteration with inverse gamma hyperprior.The value of the objective function levels off after five iterations, and this could be the basisfor a stopping criterion. Note that after seven iterations, the norm of the estimation errorstarts to grow again, typical of algorithms which exhibit semi-convergence. The specklingphenomenon, by which individual pixel values close to the discontinuity start to divergeis partly responsible for the growth of the error. This suggests that the iterations should bestopped soon after the settling of the objective function. The fact that the norm of the derivativeis small already at the end of the first iterations which indicate that the sequential iterationfinds indeed a good approximation to a minimizer.

15

Calvetti & Somersalo, Inverse Problems 24 (2008) 034013.


Non-Gaussian priors

Besov space Bspq(T):

θ(x) = c0 +

∞∑j=0

2j−1∑h=0

wj ,hψj ,h(x)

and

‖θ‖Bspq(T) :=

|c0|q +

∞∑j=0

2jq(s+12− 1p)

2j−1∑h=0

|wj ,h|pq/p

1/q

<∞.

Consider p = q = s = 1:

‖θ‖B111(T) = |c0|+∞∑j=0

2j−1∑h=0

2j/2|wj ,h|.

Then the distribution of θ is a Besov prior if αc0 and α2j/2wj ,h are

independent and Laplace(1).

Loosely, π(θ) = exp(−α‖θ‖B111(T)

).


Higher-level representations

Marked point processes, and more:

Rue & Hurn, Biometrika 86 (1999) 649–660.


Bayesian inference

Hierarchical modeling

One of the key flexibilities of the Bayesian construction!

Hierarchical modeling has important implications for the design of

efficient MCMC samplers (later in the lecture)

Examples:

1 Unknown noise variance2 Unknown variance of a Gaussian process prior (cf. choosing the

regularization parameter)3 Many more, as dictated by the physical models at hand


Example: prior variance hyperparameter in an inverse

diffusion problem

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

θ

p(θ)

or

p(θ

|d)

hyperpriorposterior, ς=10−1, 13 sensors

posterior, ς=10−2, 25 sensors

Figure : Posterior marginal density of the variance hyperparameter θ, versus

quality of data, contrasted with its hyperprior density. “Regularization” ∝ ς2/θ.


The linear Gaussian model

A key building-block problem:

Parameters θ ∈ Rn, observations y ∈ Rm

Forward model f (θ) = Gθ, where G ∈ Rm×n

Additive noise yields observations: y = Gθ + ε

ε ∼ N(0, Γobs) and is independent of θ

Endow θ with a Gaussian prior, θ ∼ N(0, Γpr ).

Posterior probability density

p(θ|y) ∝ p(y |θ)p(θ) = L(θ)p(θ)

= exp

(−

1

2(y − Gθ)T Γ−1obs (y − Gθ)

)exp

(−

1

2θTΓ−1pr θ

)= exp

(−

1

2(θ − µpos)T Γ−1pos (θ − µpos)

)Marzouk (MIT) ICERM IdeaLab 7 July 2015 20 / 29


A key building-block problem:

Parameters θ ∈ Rn, observations y ∈ Rm

Forward model f (θ) = Gθ, where G ∈ Rm×n

Additive noise yields observations: y = Gθ + ε

ε ∼ N(0, Γobs) and is independent of θ

Endow θ with a Gaussian prior, θ ∼ N(0, Γpr ).

Posterior probability density

p(θ|y) ∝ p(y |θ)p(θ) = L(θ)p(θ)

= exp

(−

1

2(y − Gθ)T Γ−1obs (y − Gθ)

)exp

(−

1

2θTΓ−1pr θ

)= exp

(−

1

2(θ − µpos)T Γ−1pos (θ − µpos)

)Marzouk (MIT) ICERM IdeaLab 7 July 2015 20 / 29


Posterior is again Gaussian:

Γpos =(

GTΓ−1obsG + Γ−1pr)−1

= Γpr − ΓprGT(

G ΓprGT + Γobs

)−1G Γpr

= (I − K G ) Γpr

µpos = ΓposGTΓ−1obsy

In the context of filtering, K is known as the (optimal) Kalman gain.

H := GTΓ−1obsG is the Hessian of the negative log-likelihood

How does low rank of H affect the structure of the posterior? How

does H interact with the prior?


Likelihood-informed directions

Consider the Rayleigh ratio

R(w) =w>Hw

w>Γ−1pr w.

When R(w) is large, likelihood dominates the prior in direction w .

The ratio is maximized by solutions to the generalized eigenvalue problem

Hw = λΓ−1pr w .

The posterior covariance can be written as a negative update along these

”likelihood-informed” directions, and approximation can be obtained by

using only r largest eigenvalues:

Γpos = Γpr −n∑i=1

λi1 + λi

wiw>i ≈ Γpr −

r∑i=1

λi1 + λi

wiw>i (1)




R(w) =w>Hw

w>Γ−1pr w.



Hw = λΓ−1pr w .





λi1 + λi

wiw>i ≈ Γpr −

r∑i=1

λi1 + λi

wiw>i (1)




R(w) =w>Hw

w>Γ−1pr w.



Hw = λΓ−1pr w .





λi1 + λi

wiw>i ≈ Γpr −

r∑i=1

λi1 + λi

wiw>i (1)


Optimality results for Γpos

It turns out that the approximation

Γpos = Γpr −r∑i=1

λi1 + λi

wiw>i (2)

is optimal in a class of loss functions L(Γpos, Γpos) for approximations of

form Γpos = Γpr − KK>, where rank(K ) ≤ r .1

Γpos minimises the Hellinger distance and the KL-divergence between

N (µpos(y), Γpos) and N (µpos(y), Γpos).

The results can also be used to devise efficient approximations for the

posterior mean.

λ = 1 means that prior and likelihood are roughly balanced. Truncate at

λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear

inverse problems, http://arxiv.org/abs/1407.3463Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29

http://arxiv.org/abs/1407.3463




λi1 + λi

wiw>i (2)






posterior mean.









λi1 + λi

wiw>i (2)






posterior mean.









λi1 + λi

wiw>i (2)






posterior mean.






Remarks on the optimal approximation

Γ∗pos = Γpr − KK>, KK> =

r∑i=1

λi1 + λi

wiw>i

The form of the optimal update is widely used (Flath et al. 2011)

Compute with Lanczos, randomized SVD, etc.

Directions wi = Γ−1pr wi maximize the relative difference between

prior and posterior variance:

Var(

w>i x)− Var

(w>i x | y

)Var

(w>i x

) =λi

1 + λi

Using the Frobenius norm as a loss would instead yield directions of

greatest absolute difference between prior and posterior variance.


A metric between covariance matrices

Forstner metric

Let A,B 0, and (σi) be the eigenvalues

of (A,B), then:

d2F (A,B) = tr[

ln2(

B−12AB−

12

)]=

∑i

ln2 (σi)

B−1/2AB−1/2

σ1σ2

Compare curvatures: supu

u>Auu>Bu

= σ1

Invariance properties:

dF (A,B) = dF(

A−1,B−1)

dF (A,B) = dF(

MAM>, MBM>)

Frobenius dF (A,B) = ‖A− B‖F does not share the same properties


Example: computerized tomography

X-rays travel from sources to detectors through an object of interest. The

intensities from the sources are measured at the detectors, and the goal is to

reconstruct the density of the object.

0 20 40 60 80 1000.9

0.92

0.94

0.96

0.98

1

1.02

detector pixel

intensity

This synthetic example is motivated by a real application: real-time X-ray

imaging of logs that enter a saw mill for the purpose of automatic quality

control. 2

2Check www.bintec.fiMarzouk (MIT) ICERM IdeaLab 7 July 2015 26 / 29


Weaker data → faster decay of generalized eigenvalues → lower order

approximations possible.

index i0 100 200 300 400 500

eigenvalues

10-8

10-7

10-6

10-5

10-4

10-3

prior

limited angle

full angle

index i100 200 300 400 500

gen

eralizedeigenvalues

10-2

10-1

100

101

102

103

104

105

limited angle

full angle

rank of update100 200 300 400 500

dF

100

101

102

103

limited angle

full angle

In the limited angle case, roughly r = 200 is enough to get a good

approximation (with full angle r ≈ 800 needed).

pri or

ï5.8

ï5.6

ï5.4rank = 50 rank = 100 rank = 200 poste ri or

ï7

ï6.8

ï6.6

ï6.4

ï6.2



Approximation of the mean: µpos(y) = Γpos G>Γ−1obs y ≈ Ar y

Note: pre-computing the LIS basis offline allows fast online reconstructions for

repeated data!


Questions yet to answer

How to simulate from or explore the posterior distribution?

How to make Bayesian inference computationally tractable when the

forward model is expensive (e.g., a PDE) and the parameters are

high- or infinite-dimensional?

Downstream questions: model selection, optimal experimental

design, decision-making, etc.


The Bayesian approach to inverse problems€¦ · Statistical inference Why is a statistical perspective useful in inverse problems? To characterize uncertainty in the inverse solution

Documents