On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang
On Bayesian Computation
Michael I. Jordan
with Elaine Angelino, Maxim Rabinovich, Martin Wainwrightand Yun Yang
Previous Work: Information Constraints on Inference
I Minimize the minimax risk under constraintsI privacy constraintI communication constraintI memory constraint
I Yields tradeoffs linking statistical quantities (amount of data,complexity of model, dimension of parameter space, etc) to“externalities” (privacy level, bandwidth, storage, etc)
Ongoing Work: Computational Constraints on Inference
I Tradeoffs via convex relaxations
I Tradeoffs via concurrency control
I Bounds via optimization oracles
I Bounds via communication complexity
I Tradeoffs via subsampling
I All of this work has been frequentist; how about Bayesianinference?
Today’s talk: Bayesian Inference and Computation
I Integration rather than optimization
I Part I: Mixing times for MCMC
I Part II: Distributed MCMC
Part I: Mixing Times for MCMC
MCMC and Sparse Regression
I Statistical problem: predict/explain a response variable Ybased on a very large number of features X1, . . . ,Xd .
I Formally:
Y = X1 β1 + X2 β2 + · · ·+ Xd βd + noise, or
Yn×1 = Xn×d βd×1 + noise, (matrix form)
number of features d � sample size n.
I Sparsity condition: number of non-zero βj ’s � n.
I Goal: variable selection—identify influential features
I Can this be done efficiently (in a computational sense)?
Statistical Methodology
I Frequentist approach: penalized optimization
Objective function f (β)
=1
2n‖Y − Xβ‖22︸ ︷︷ ︸
Goodness of fit
+ Pλ(β)︸ ︷︷ ︸Penalty term
I Bayesian approach: Monte Carlo sampling
Posterior probability P(β |Y )
∝ P(Y |β)︸ ︷︷ ︸Likelihood
× P(β)︸ ︷︷ ︸Prior probability
Computation and Bayesian inference
Posterior probability P(β |Y )
= C (Y )︸ ︷︷ ︸Normalizing
constant
× P(Y |β)︸ ︷︷ ︸Likelihood
× P(β)︸ ︷︷ ︸Prior probability
I The most widely used tool for fitting Bayesian models is thesampling technique based on Markov chain Monte Carlo(MCMC)
I The theoretical analysis of the computational efficiency ofMCMC algorithms lags that of optimization-based methods
Computational Complexity
I Central object of interest: the mixing time tmix of the Markovchain
I Bound tmix as a function of problem parameters (n, d)
I rapidly mixing—tmix grows at most polynomially
I slowly mixing—tmix grows exponentially
I It has long been believed by many statisticians that MCMCfor Bayesian variable selection is necessarily slowly mixing;efforts have been underway to find lower bounds but suchbounds have not yet been found
Model
Mγ : Linear model: Y = Xγβγ + w , w ∼ N (0, φ−1In)
Precision prior: π(φ) ∝ 1
φ
Regression prior:(βγ | γ
)∼ N (0, g φ−1(XT
γ Xγ)−1)
Sparsity prior: π(γ) ∝(1
p
)κ|γ|I[|γ| ≤ s0].
Assumption A (Conditions on β∗)
The true regression vector has components β∗ = (β∗S , β∗Sc ) that
satisfy the bounds
Full β∗ condition:∥∥ 1√
nXβ∗
∥∥22≤ g σ20
log p
n
Off-support Sc condition:∥∥ 1√
nXScβ∗Sc
∥∥22≤ Lσ20
log p
n,
for some L ≥ 0.
Assumption B (Conditions on the design matrix)
The design matrix has been normalized so that ‖Xj‖22 = n for allj = 1, . . . , p; moreover, letting Z ∼ N(0, In), there exist constantsν > 0 and L <∞ such that Lν ≥ 4 and
Lower restricted eigenvalue (RE(s)):
min|γ|≤s
λmin
(1
nXTγ Xγ
)≥ ν,
Sparse projection condition (SI(s)):
EZ
[max|γ|≤s
maxk∈[p]\γ
1√n
∣∣〈(I − Φγ
)Xk , Z 〉
∣∣] ≤ 1
2
√Lν log p,
where Φγ denotes projection onto the span of {Xj , j ∈ γ}.
This is a mild assumption, needed in the information-theoreticanalysis of variable selection.
Assumption C (Choices of prior hyperparameters)
The noise hyperparameter g and sparsity penalty hyperparameterκ > 0 are chosen such that
g � p2α for some α > 0, and
κ+ α ≥ C1(L + L) + 2 for some universal constant C1 > 0.
Assumption D (Sparsity control)
For a constant C0 > 8, one of the two following conditions holds:
Version D(s∗): We set s0 : = p in the sparsity prior and the truesparsity s∗ is bounded as
max{1, s∗} ≤ 1
8C0K
{ n
log p− 16Lσ20
}for some constant K ≥ 4 +α+ cL, where c is a universal constant.
Version D(s0): The sparsity parameter s0 satisfies the sandwichrelation
max{
1,(2ν−2 ω(X ) + 1
)s∗}≤ s0 ≤
1
8C0 K
{ n
log p− 16Lσ20
},
where ω(X ) : = maxγ|||(XT
γ Xγ)−1XTγ Xγ∗\γ |||2op.
Our Results
Theorem (Posterior concentration)
Given Assumption A, Assumption B with s = Ks∗, Assumption C,and Assumption D(s∗), if Cβ satisfies
C 2β ≥ c0 ν
−2 (L + L + α + κ)σ20log p
n,
we have πn(γ∗ | Y ) ≥ 1− c1 p−1 with high probability.
Compare to `1-based approaches,which require an irrepresentablecondition:
maxk∈[d ],
S: |S |=s∗
‖XTk XS(XT
S XS)−1‖1 < 1
Our Results
Theorem (Rapid mixing)
Suppose that Assumption A, Assumption B with s = s0,Assumption C, and Assumption D(s0) all hold. Then there areuniversal constants c1, c2 such that, for any ε ∈ (0, 1), the ε-mixingtime of the Metropolis-Hastings chain is upper bounded as
τε ≤ c1 ps20
(c2α (n + s0) log p + log(1/ε) + 2
)with probability at least 1− c3p
−c4 .
High Level Proof Idea
I A Metropolis-Hasting random walk on the d-dim hypercube{0, 1}d
I Canonical path ensemble argument: for any model γ 6= γ∗,find a path from γ to γ∗, along which acceptance ratios arehigh, where γ∗ is the true model
Part II: Distributed MCMC
Traditional MCMC
I Serial, iterative algorithm for generating samples
I Slow for two reasons:
(1) Large number of iterations required to converge
(2) Each iteration depends on the entire dataset
I Most research on MCMC has targeted (1)
I Recent threads of work target (2)
Serial MCMC
Data
Single core
Samples
Data-parallel MCMC
Data
Parallel cores
“Samples”
Aggregate samples from across partitions — but how?
Aggregate
Data
Parallel cores
“Samples”
Factorization motivates a data-parallel approach
π(θ | x)︸ ︷︷ ︸posterior
∝ π(θ)︸︷︷︸prior
π(x | θ)︸ ︷︷ ︸likelihood
=J∏
j=1
π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior
Factorization motivates a data-parallel approach
π(θ | x)︸ ︷︷ ︸posterior
∝ π(θ)︸︷︷︸prior
π(x | θ)︸ ︷︷ ︸likelihood
=J∏
j=1
π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior
I Partition the data as x(1), . . . , x(J) across J cores
I The jth core samples from a distribution proportional to thejth sub-posterior (a ‘piece’ of the full posterior)
I Aggregate the sub-posterior samples to form approximate fullposterior samples
Aggregation strategies for sub-posterior samples
π(θ | x)︸ ︷︷ ︸posterior
∝ π(θ)︸︷︷︸prior
π(x | θ)︸ ︷︷ ︸likelihood
=J∏
j=1
π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior
Sub-posterior density estimation (Neiswanger et al, UAI 2014)
Weierstrass samplers (Wang & Dunson, 2013)
Weighted averaging of sub-posterior samples
I Consensus Monte Carlo (Scott et al, Bayes 250, 2013)
I Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)
Aggregate ‘horizontally’ across partions
Aggregate
Data
Parallel cores
“Samples”
Naıve aggregation = Average
Aggregate
= + x 0.5 x 0.5
( , )
= + x x
Aggregate( , )
Less naıve aggregation = Weighted average
= + x 0.58 x 0.42
Aggregate( , )
= + x x
Aggregate( , )
Consensus Monte Carlo (Scott et al, 2013)
= + x x
Aggregate( , ) = + x x
Aggregate( , )
I Weights are inverse covariance matrices
I Motivated by Gaussian assumptions
I Designed at Google for the MapReduce framework
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
F = aggregation function
qF = approximate distribution
L (F )︸ ︷︷ ︸objective
= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood
+ H [qF ]︸ ︷︷ ︸entropy
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
F = aggregation function
qF = approximate distribution
L (F )︸ ︷︷ ︸objective
= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood
+ H [qF ]︸ ︷︷ ︸relaxed entropy
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
F = aggregation function
qF = approximate distribution
L (F )︸ ︷︷ ︸objective
= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood
+ H [qF ]︸ ︷︷ ︸relaxed entropy
No mean field assumption
Variational Consensus Monte Carlo
= + x x
Aggregate( , ) = + x x
Aggregate( , )
I Optimize over weight matrices
I Restrict to valid solutions when parameter vectors constrained
Variational Consensus Monte Carlo
Theorem (Entropy relaxation)
Under mild structural assumptions, we can choose
H [qF ] = c0 +1
K
K∑k=1
hk (F ) ,
with each hk a concave function of F such that
H [qF ] ≥ H [qF ] .
We therefore haveL (F ) ≥ L (F ) .
Variational Consensus Monte Carlo
Theorem (Concavity of the variational Bayes objective)
Under mild structural assumptions, the relaxed variational Bayesobjective
L (F ) = EqF [log π (X, θ)] + H [qF ]
is concave in F .
Empirical evaluation
I Compare three aggregation strategies:
I Uniform average
I Gaussian-motivated weighted average (CMC)
I Optimized weighted average (VCMC)
I For each algorithm A, report approximation error of someexpectation Eπ[f ], relative to serial MCMC
εA (f ) =|EA [f ]− EMCMC [f ]||EMCMC [f ]|
I Preliminary speedup results
Example 1: High-dimensional Bayesian probit regression
#data = 100, 000, d = 300
First moment estimation error, relative to serial MCMC
(Error truncated at 2.0)
Example 2: High-dimensional covariance estimation
Normal-inverse Wishart model
#data = 100, 000, #dim = 100 =⇒ 5, 050 parameters
(L) First moment estimation error (R) Eigenvalue estimation error
Example 3: Mixture of 8, 8-dim Gaussians
Error relative to serial MCMC, for cluster comembershipprobabilities of pairs of test data points
VCMC reduces CMC error at the cost of speedup (∼2x)
VCMC speedup is approximately linear
CMC
VCMC
Discussion
Contributions
I Convex optimization framework for Consensus Monte Carlo
I Structured aggregation accounting for constrained parameters
I Entropy relaxation
I Empirical evaluation