Tarek El Moselhy & Youssef Marzouk Massachusetts Institute of Technology
Department of Aeronautics & Astronautics
Inverse problems
• Infer model parameters from indirect, noisy, and limited observations
• Problems are often ill-posed and high-dimensional
• Example: estimate subsurface properties from observations of pressure and transport (here, saturation)
Inverse problems
• Infer model parameters from indirect, noisy, and limited observations
• Problems are often ill-posed and high-dimensional
• Example: infer electrochemical kinetic parameters and pathways from system-level data
Inverse problems
• Infer model parameters from indirect, noisy, and limited observations
• Problems are often ill-posed and high-dimensional
⇒ An essential step in predictive simulation: endow parameters and subsequent predictions with quantified uncertainties
• Inversion as statistical inference — a Bayesian approach – Model parameters represented by random variable x. Data d. – Apply Bayes’ rule:
posterior density
likelihood function L()
prior density
Statistical inference
• The posterior density is the full Bayesian solution to the inference problem – Not just a single value for x, but a probability density – A complete description of uncertainty – An input to future simulations — prediction and experimental design
d x( ) p x d( )
p x d( ) = p d x( ) p x( )p d x( ) p x( )dx
Computational issues
• Extracting information from the posterior – Means, variances, higher moments; marginal distributions;
realizations; posterior predictions; decisions (expected utility) – Posterior evaluations may be expensive (forward model = PDE) – Parameter x may be high-dimensional
• Usual approaches involve Markov chain Monte Carlo (MCMC) sampling; very useful, but… – Generates a stream of correlated samples – Proposal design is difficult; potential for poor mixing – No clear convergence criteria – Requires a large number of forward model evaluations—though
surrogate and reduced models (e.g., polynomial chaos, state projection) can help
– Not recursive
An alternative viewpoint
forward model y(x) + error model
+ data d [i.e., likelihood]
prior knowledge
posterior knowledge
f
Z = f ( X )
“prior random variable”
“posterior random variable”
Z X d ~ d
Random variable transformation
Prior density p
Posterior density πd
transformation
X
X
Z
Z
Z = f X( )
Can we compute an appropriate map f ?
Properties of the map
• Map “pushes forward” the prior measure to the posterior measure
• Potential advantages and desiderata – Generate arbitrary numbers of independent posterior samples, without
additional forward solves – Clear convergence criterion – Analytical expressions for posterior moments – Computationally less expensive? Opportunities for parallelization? – Propagate posterior through subsequent models? Apply recursively? – Compute posterior normalizing constant (evidence or marginal
likelihood, for use in model selection)
Outline
1 Motivation and concept
2 Formulation
3 Solution methods
4 Numerical examples
Formulation
• Some notation: – Map is (we will discuss the functional form of f later)
– Forward model is
– Start with the posterior density as
d (z) =
L z;d( ) p z( )
L z;d( ) p z( )
y :n nobs
f :n n
• Assume we know an invertible f such that • Perform a transformation from the posterior to the prior to get a
probability density for X
Z = f X( )
Formulation
q(x) is a probability density for the prior random variable X, parameterized by f
Jacobian of the transformation q x; f( ) =
L f x( );d( ) p f x( )( )
detdfdx
(z) =
L z;d( ) p z( )
Formulation
• But we already know the density of X, namely the prior
• The transformed distribution should then satisfy
• Find f(x) such that q is close to p – For instance, minimize Kullback-Leibler (KL) divergence,
Hellinger distance, etc – Some analogy with variational Bayes (though different!)
q x; f( ) = p x( )
p x( )
prior rv X
posterior rv Z
transformation πd
p q
Find transformation such that p = q
q
Formulation schematic
Formulation
• Putting q close to p… – Kullback-Leibler divergence
– Thus T must be constant in x – Same result holds true for Hellinger and other “distances” – T = constant also obtained by pointwise equality p(x) = q(x), but
reveals the posterior normalizing constant β
– As a byproduct of inference, we will calculate the evidence!
DKL p q( ) = log
p x( )q x( ) p x( )dx = 0 E exp T ( X ; f )( ) = exp E T ( X ; f ) ( )
where T x; f( ) log L f x( );d( )( ) + log p f x( )( )( ) + log det
dfdx
log p x( )( )
T x; f( ) = log L f x( );d( )( ) + log p f x( )( )( ) + log det dfdx
log p x( )( ) = log
Optimization problems
1. Minimize variance
2. Pointwise equality (T = constant in the L2 sense)
3. Explicitly minimize KL-divergence
• Note: expectation and variance are all with respect to the prior distribution p
T xi; f( ) = E T ( X ; f ) , i = 1…N
min
fVar T X ; f( )
min
fDKL p(x) q(x)( )
Example of T
• In the case of a Gaussian prior (identity covariance) and additive Gaussian noise, the expression for T(x; f) is
T x; f( ) = 12
y f x( )( ) d( )T n1 y f x( )( ) d( ) 1
2f x( )T f x( ) + log det df
dx
+12
xTx
T x; f( ) = log L f x( );d( )( ) + log p f x( )( )( ) + log det dfdx
log p x( )( )
L f x( );d( ) = exp 12y f x( )( ) d( )T n1 y f x( )( ) d( )
p x( ) = exp 12xTx
Existence and uniqueness
• In general, map exists but is not unique – Example: linear forward model, additive Gaussian noise,
zero-mean Gaussian prior:
– Posterior distribution is Gaussian and known in closed form:
– Any affine transformation such that represents a valid map
– L is not uniquely defined: Cholesky, matrix square root, etc
y(x) = Ax, d = y + , X ~ N (0, I ), ~ N (0,n )
z = I + ATn
1A( )1, µz = z AT n
1 dZ ~ N µz ,z( )
Z = f X( ) = µZ + LX
Existence and uniqueness
• Connections with optimal transport theory [Caffarelli, McCann, … ]: – Use distance minimization to guarantee existence and
uniqueness of an invertible map
– Map is the gradient of a convex scalar function
• Full formulation becomes, e.g.,
minfE f (X ) X( )T f (X ) X( )
minfVar T + E f (X ) X( )T f (X ) X( )
Implementation issues
• Optimization problems – For use Newton and variants
– Take full advantage of forward adjoints and Hessians to compute derivates wrt degrees of freedom in f
– Expectation/variance computed using samples from the prior
– For T(xi) = constant again use prior samples – Nonlinear least squares; again uses adjoint information
min
fVar T X ; f( )
Implementation issues
• Represent f using an orthogonal polynomial expansion (e.g., Hermite chaos) – Number of coefficients is equal to the dimension of the
parameter space times the number of coefficients
• Evaluation of T can use efficient surrogate forward models y(z) – Compute, for example, using prior gPC expansion…
• Important open question: can we represent monotone functions (or convex functions) efficiently?
Vector of orthogonal polynomials
Matrix of unknown coefficients
Outline
1 Motivation and concept
2 Formulation
3 Solution methods
4 Numerical examples
Simple linear example
• 100 dimensional problem • A is randomly generated • Gaussian posterior
y x( ) = Ax, d = y +
X ~ N (0, I ), ~ N (0,n )
• Start from identity map f(X) = X • Convergence to exact solution in 12 iterations
A8100; y, d 8
D(p||q) and var(T) evidence p(d)
Reaction kinetics
• Five late-time observations of A; truth is k1 = 1, k2 = 2
• Gaussian prior • Infer k1 and k2
convergence: DKL < 10-5
prior posterior
Reaction kinetics: map
• 7th order polynomial map • Transformation Jacobian is positive definite except at a few
points at the tail of the prior
First component of posterior
Second component of posterior
Reaction kinetics: map
• 7th order polynomial map • Transformation Jacobian is positive definite except at a few
points at the tail of the prior
First component of posterior
Second component of posterior
Nonlinear PDE
• Reaction-diffusion equation • Gaussian prior covariance for log-
diffusivity, (Lc = 0.25, σ = 1.25)
p( ) = 50p 1 p( )
log 0( ) ~ GP 0,C( )
C r1,r2( ) = 2 exp r1 r2
2
2Lc2
• 3481 spatial elements, prior parameterized with 16 K-L modes, 20 observations at random locations
• Third order map • At solution: KL-divergence = 0.0075, Var[T] = 0.0136
p = 0 p = 1
p = 1
p = 0
unit square
Results de
t(df/d
x)
xiT xi
Marginal posterior of K-L mode weights
Results
Map captures posterior dependencies among K-L mode weights
Elliptic PDE with high-dim inputs
• Elliptic PDE in two dimensions, 61x61 spatial grid • Log-normal prior with an exponential covariance
kernel and large variance: 58 K-L modes, σ = 1.75, Lc=2
• 200 observation points • Map up to 3rd order
p( ) = 0log 0( ) ~ GP 0,C( )
C r1,r2( ) = 2 exp r1 r2Lc
Posterior median and 0.1/0.9 quantiles
Elliptic PDE in high dimensions
true log-permeability
“gold-standard” MCMC (107 samples), 4.6 hours
yet effective sample size < 2500
map/optimization, 28 minutes
Elliptic PDE in high dimensions
Elliptic PDE in high dimensions
map/optimization = asterisk; MCMC = diamond; long chain MCMC = line
• Compare map-inference to MCMC at equivalent wall-clock time
Conclusions
• A new map-based approach to Bayesian inference – Find a function that pushes forward the prior measure to the posterior – Connections with optimal transport theory – Full posterior now obtained by solving an optimization problem – Clear convergence criterion – Evidence computed “for free” – Should be recursive and easily parallelizable
• Many open issues… – Favorable performance comparison with MCMC on ill-posed PDE-
based problems – But, overall scaling with dimension, size/quality of data – Efficient optimization approaches; better ways to parameterize the map
• Support from DOE/Office of Advanced Scientific Computing Research (ASCR)