Bayesian Inversion: Algorithms · Bayesian Inversion: Algorithms Andrew M Stuart1 1Mathematics Institute and Centre for Scientiﬁc Computing University of Warwick Woudshoten Lectures

SETTING AND ASSUMPTIONS MAP ESTIMATORS KULLBACK-LEIBLER APPROXIMATION SAMPLING CONCLUSIONS

Bayesian Inversion:Algorithms

Andrew M Stuart1

1Mathematics Institute andCentre for Scientific Computing

University of Warwick

Woudshoten Lectures 2013October 4th 2013

Work funded by EPSRC, ERC and ONR


http://homepages.warwick.ac.uk/∼masdr/

A.M. Stuart. Inverse problems: a Bayesian perspective.Acta Numerica 19(2010).∼masdr/BOOKCHAPTERS/stuart15c.pdfM. Dashti, K.J.H. Law, A.M. Stuart and J. Voss. MAPestimators and posterior consistency . . . .Inverse Problems, 29(2013), 095017. arxiv:1303.4795.

F. Pinski, G. Simpson, A.M. Stuart and H. Weber.Kullback-Leibler approximation for probability measures oninfinite dimensional spaces. In preparation.

S.L. Cotter , G.O. Roberts, A.M. Stuart and D. White.MCMC methods for functions . . . . Statistical Science28(2013). arxiv:1202.0709.

M. Hairer, A.M.Stuart and S. Vollmer. Spectral gaps for aMetropolis-Hastings algorithm . . . . arxiv:1112.1392.


Outline

1 SETTING AND ASSUMPTIONS

2 MAP ESTIMATORS

3 KULLBACK-LEIBLER APPROXIMATION

4 SAMPLING

5 CONCLUSIONS


Outline


2 MAP ESTIMATORS


4 SAMPLING

5 CONCLUSIONS


The Setting

Probability measure µ on Hilbert space H.Reference measure µ0 (often a prior).µ related to µ0 by (often Bayes’ Theorem)

dµdµ0

(u) =1

Zµexp(−Φ(u)

).

Another way of saying the same thing:

Eµf (u) =1

ZµEµ0(

exp(−Φ(u)

)f (u)

).

How do we get information from µ if we know µ0 and Φ?


The Talk In One Picture

Gaussian

MAP

Samples


The Assumptions

µ0 = N(0,C0) a centred Gaussian measure on H.µ0(X ) = 1; X (Banach) continuously embedded in H.

Let E = D(C− 1

20 ) (Cameron-Martin space).

Then E ⊂ X ⊆ H. E (Hilbert) compactly embedded in X .The function Φ ∈ C(X ;R+).

For all u, v with ‖u‖X ≤ r , ‖v‖X ≤ r there are Mi(r):

|Φ(u)| ≤ M1(r);

|Φ(u)− Φ(v)| ≤ M2(r)‖u − v‖.


Outline


2 MAP ESTIMATORS


4 SAMPLING

5 CONCLUSIONS


Probability Maximizers and Tikhonov Regularization

Define the Tikhonov-regularized LSQ functional I : E → R+ by

I(u) :=12‖C−

12

0 u‖2 + Φ(u).

Let Bδ(z) be a ball of radius δ in X centred at z ∈ E = D(C− 1

20 ).

Theorem(Dashti, Law, S and Voss, 2013). The probability measure µand functional I are related by

limδ→0

µ(Bδ(z1)

)µ(Bδ(z2)

) = exp(I(z2)− I(z1)

).

Thus probability maximizers are minimizers of the regularizedTikhonov functional I.


Existence of Probability Maximizers

The minimization is well-defined:

Theorem(S, Acta Numerica, 2010). ∃u ∈ E :

I(u) = I := infI(u) : u ∈ E.

Furthermore, if un is a minimizing sequence satisfyingI(un)→ I then there is a subsequence un′ that convergesstrongly to u in E.


Example: Navier-Stokes Inversion for Initial Condition

Stream function

x1

x 2

−1 −0.5 0 0.5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

|uk|

k1

k 2

−10 0 10

−15

−10

−5

0

5

10

15

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Stream function

x1

x 2

−1 −0.5 0 0.5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

|uk|

k1

k 2

−10 0 10

−15

−10

−5

0

5

10

15

−0.4

−0.2

0

0.2

0.4

0.6

0.8

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Incompressible NSE on ΩT = T2× (0,∞) :

∂tv − ν4v + v · ∇v +∇p = f in ΩT ,∇ · v = 0 in ΩT ,

v |t=0 = u inT2.

yj,k = v(xj , tk ) + ηj,k , ηj,k ∼ N(0, σ2I2×2).

y = G(u) + η, η ∼ N(0, σ2I).C0 = (−4stokes)

−2; Φ = 1103σ2 |y − G(u)|2.


Example: Navier-Stokes Inversion for Initial Condition

102

103

104

105

106

10710

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

1/σ

|u*−u+|/|u+|

|G(u*)−G(u+)|/|G(u+)||G(u*)−y|/|y|

Figure: MAP estimator u?; Truth u†


Outline


2 MAP ESTIMATORS


4 SAMPLING

5 CONCLUSIONS


The Objective Functional

Recall µ0 = N(0,C0) and µ(du) ∝ exp(−Φ(u)

)µ0(du). Let A

denote a set of simple measures on H (usually Gaussian).

ProblemFind ν ∈ A that minimizes I(ν) := DKL(ν‖µ).

Here DKL =Kullbach-Leibler divergence = relative entropy

DKL(ν‖µ) =

∫H

dνdµ(x) log

(dνdµ(x)

)µ(dx) if ν µ

+∞ else.

We note, for intuition, the inequality:

dHell(ν, µ)2 ≤ 2DKL(ν‖µ).


Existence of Minimizers

The minimization is well-defined:

Theorem(Pinski, Simpson, S, Weber, 2013) If A is closed under weakconvergence and there is ν ∈ A with I(ν) <∞ then ∃ ν ∈ Asuch that

I(ν) = I := infI(ν) : ν ∈ A.Furthermore, if νn is a minimizing sequence satisfyingI(νn)→ I then there is a subsequence νn′ that converges to νin the Hellinger metric:

dHell(νn, ν)→ 0.

Example: A := G = Gaussian measures on H.


Parameterization of G

Gaussian case equivalent to ν = N(

m,(C−1

0 + Γ)−1).

J(m, Γ) =

DKL(ν‖µ) if (m, Γ) ∈ E ×HS(E ,E∗)+∞ else.

Theorem

(Pinski, Simpson, S, Weber, 2013) ∃ (m, Γ) ∈ E ×HS(E ,E∗) :

J(m, Γ) = J := infJ(m, Γ) : (m, Γ) ∈ E ×HS(E ,E∗).

Furthermore, if (mn, Γn) is a minimizing sequence satisfyingJ(mn, Γn)→ J then there is a subsequence (mn′ , Γn′) suchthat

‖mn′ −m‖E + ‖Γn′ − Γ‖HS(E ,E∗) → 0.


Cautionary Examples

−1 −0.5 0 0.5 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

x

ϕ(n

x)

n =1n =4n =16

−1 −0.5 0 0.5 10

10

20

30

40

50

60

70

x

nϕ(n

x)

n =1n =4n =16

µ0 is Brownian bridge on [−1,1] :

νn := N(

0,(C−1

0 + Γn)−1)

.

Either:

(Γnu)(x) = ϕ(nx)u(x), ϕ ∈ C∞per, meanϕ

νn ⇒ ν := N(

0,(C−1

0 + ϕId)−1)

Or:

(Γnu)(x) = nϕ(nx)u(x), ϕ ∈ C∞0 , ‖ϕ‖L1 = 1

νn ⇒ ν := N(

0,(C−1

0 + δ0Id)−1).


Regularization of J I

Let (S, ‖ · ‖S) be compact in L(E ,E∗).

Jδ(m, Γ) =

J(m, Γ) + δ‖Γ‖2S if (m, Γ) ∈ E × S+∞ else.

Theorem

(Pinski, Simpson, S, Weber, 2013) ∃ (m, Γ) ∈ E × S :

Jδ(m, Γ) = Jδ := infJδ(m, Γ) : (m, Γ) ∈ E × S.

Furthermore, if νn(mn, Γn) is a minimizing sequencesatisfying Jδ(mn, Γn)→ Jδ then there is a subsequence(mn′ , Γn′) such that

dHell(νn′ , ν) + ‖Γn′ − Γ‖S → 0.


Regularization of J II

Let H = L2(Ω), Ω ⊂ Rd , C0 = (−4)−α with α > d/2.Choose (Γu)(x) = B(x)u(x) and S = H r , r > 0.Thus ν = N(m,C), C−1 = C−1

0 + B(·)Id for potential B.

Jδ(m,B) =

J(m,B) + δ‖B‖2H r if (m,B) ∈ Hα × H r

+∞ else.

Theorem

(Pinski, Simpson, S, Weber, 2013) ∃ (m,B) ∈ Hα × H r :

Jδ(m,B) = Jδ := infJδ(m,B) : (m,B) ∈ Hα × H r.


Example: Conditioned Diffusion in a Double Well

−1.5 −1 −0.5 0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

x

V(x

)

V (x) = (x2 − 1)2

−5 0 5−1.5

−1

−0.5

0

0.5

1

1.5

t

Xt

Consider the conditioned diffusion

dXt = −∇V (Xt )dt +√

2εdWts, .X−T = x− < 0,X+T = x+ > 0.

For x− = −x+, by symmetry, we can studypaths satisfying X0 = 0, XT = x+Path space distribution approximated byN(m(t),C), with

C−1 = 12ε

(− d2

dt2 + BI)

B is either a constant or B = B(t)m and B obtained by minimizing DKL


Stochastic Root Finding & OptimizationRobbins-Monro with Derivatives and Iterate Averaging

Functions Estimated Via Sampling

Assume f (x) (the target function) can be estimated via F (y ; x)as

f (x) = EY [F (Y ; x)], f ′(x) = EY [∂xF (Y ; x)]

Iteration Scheme (See, for instance, Asmussen & Glynn)

xn+1 = xn − an

(1M

M∑i

∂xF (Yi ; xn))−1(

1M

M∑i

F (Yi ; xn)),

with an ∼ n−γ , γ ∈ (12 ,1) and Yi i.i.d. Also let xn ≡

∑nj=1 xi .

Then xn → x?, with f (x?) = 0, in distribution at rate n−1/2.


Numerical Results With Constant Potential BT = 5, ε = .15,104 Iterations, 103 Samples per Iteration, 102 Points in (0,T ) per Sample

01

23

45

0

5000

100000

0.2

0.4

0.6

0.8

1

tn

mn

102

103

104

0

10

20

30

40

50

60

n

Bn

Bn


Numerical Results With Variable Potential BT = 5, ε = .15, δ = 1

2 × 10−4, r = 1 so H1 regularization104 Iterations, 104 Samples per Iteration, 102 Points in (0,T ) per Sample

01

23

45

0

5000

100000

0.2

0.4

0.6

0.8

1

tn

mn

01

23

45

0

5000

100000

20

40

60

80

tn

Bn


Model Comparison95% Confidence Intervals about the Mean Path

Constant B

0 1 2 3 4 5−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

t

Xt

Variable B

0 1 2 3 4 5−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

t

Xt


Outline


2 MAP ESTIMATORS


4 SAMPLING

5 CONCLUSIONS


MCMC

MCMC: create an ergodic Markov chain u(k) which isinvariant for approximate target µ (or µN the approximationon RN ) so that

1K

K∑k=1

f (u(k))→ Eµf

.

Recall µ0 = N(0,C0) and µ(du) ∝ exp(−Φ(u)

)µ0(du).

Recall the Tikhonov functional I(u) = 12‖C

− 12

0 u‖2 + Φ(u).


Standard Random Walk Algorithm

Metropolis, Rosenbluth, Teller and Teller,J. Chem. Phys. 1953.

Set k = 0 and Pick u(0).Propose v (k) = u(k) + βξ(k), ξ(k) ∼ N(0,C0).

Set u(k+1) = v (k) with proability a(u(k), v (k)).

Set u(k+1) = u(k) otherwise.k → k + 1.

Here a(u, v) = min1,exp(I(u)− I(v)

).


New Random Walk Algorithm

Cotter, Roberts, S and White, Stat. Sci. 2013.

Set k = 0 and Pick u(0).Propose v (k) =

√(1− β2)u(k) + βξ(k), ξ(k) ∼ N(0,C0).

Set u(k+1) = v (k) with proability a(u(k), v (k)).

Set u(k+1) = u(k) otherwise.k → k + 1.

Here a(u, v) = min1,exp(Φ(u)− Φ(v)

).


Example: Navier-Stokes Inversion for Forcing

10−5 10−4 10−3 10−2 10−1 100

β

0.0

0.2

0.4

0.6

0.8

1.0

Ave

rage

Acc

epta

nce

Pro

bab

ilit

y

SRWMH, ∆x = 0.100

SRWMH, ∆x = 0.050

SRWMH, ∆x = 0.020

SRWMH, ∆x = 0.010

SRWMH, ∆x = 0.005

SRWMH, ∆x = 0.002

10−5 10−4 10−3 10−2 10−1 100

β

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Ave

rage

Acc

epta

nce

Pro

bab

ilit

y

RWMH, ∆x = 0.100

RWMH, ∆x = 0.050

RWMH, ∆x = 0.020

RWMH, ∆x = 0.010

RWMH, ∆x = 0.005

RWMH, ∆x = 0.002

Incompressible NSE onΩT = T2 × (0,∞) :

∂tv − ν4v + v · ∇v +∇p = u in ΩT ,∇ · v = 0 in ΩT ,

v |t=0 = v0 inT2.

yj,k = v(xj , tk )+ξj,k , ξj,k ∼ N(0, σ2I2×2).

y = G(u) + ξ, ξ ∼ N(0, σ2I).Prior OU process; Φ = 1

σ2 |y − G(u)|2.


Spectral Gaps

Theorem(Hairer, S, Vollmer, arXiv 2012.)

For the standard Random walk algorithm the spectral gapis bounded above by C N−

12 .

For the new Random walk algorithm the spectral gap isbounded below independently of dimension.


Outline


2 MAP ESTIMATORS


4 SAMPLING

5 CONCLUSIONS


What We Have Shown

We have shown that:

Common Structure: A range of problems requireextracting information from a probability measure on aHilbert space, having density with respect to a Gaussian.Algorithmic Approaches We have laid the foundations ofa range of computational methods related to this task.MAP Estimators Maximum a posteriori estimators can bedefined on Hilbert space; there is a link to Tikhonovregularization.Kullback-Leibler Approximation Kullback-Leiblerapproximation can be defined on Hilbert space and findingthe closest Gaussian results in a well-defined problem inthe calculus of variations.Sampling MCMC methods can be defined on Hilbertspace. Results in new algorithms robust to discretization.


http://homepages.warwick.ac.uk/∼masdr/

A.M. Stuart. Inverse problems: a Bayesian perspective.Acta Numerica 19(2010).

M. Dashti, K.J.H. Law, A.M. Stuart and J. Voss. MAPestimators and posterior consistency . . . .Inverse Problems, 29(2013), 095017. arxiv:1303.4795.

F. Pinski, G. Simpson, A.M. Stuart and H. Weber.Kullback-Leibler approximation for probability measures oninfinite dimensional spaces. In preparation.

S.L. Cotter , G.O. Roberts, A.M. Stuart and D. White.MCMC methods for Functions . . . . Statistical Science28(2013). arxiv:1202.0709.

M. Hairer, A.M.Stuart and S. Vollmer. Spectral gaps for aMetropolis-Hastings algorithm . . . . arxiv:1112.1392.

Bayesian Inversion: Algorithms · Bayesian Inversion: Algorithms Andrew M Stuart1 1Mathematics Institute and Centre for Scientiﬁc Computing University of Warwick Woudshoten Lectures

Documents