Top Banner
Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017
134

Scaling up Bayesian Inference

Apr 08, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling up Bayesian Inference

Scaling up Bayesian Inference

David Dunson

Departments of Statistical Science, Mathematics & ECE, Duke University

May 1, 2017

Page 2: Scaling up Bayesian Inference

Outline

Motivation & background

EP-MCMC

aMCMC

Discussion

Motivation & background 2

Page 3: Scaling up Bayesian Inference

Complex & high-dimensional data

j Interest in developing new methods for analyzing & interpretingcomplex, high-dimensional data

j Arise routinely in broad fields of sciences, engineering & evenarts & humanities

j Despite huge interest in big data, there are vast gaps that havefundamentally limited progress in many fields

Motivation & background 2

Page 4: Scaling up Bayesian Inference

Complex & high-dimensional data

j Interest in developing new methods for analyzing & interpretingcomplex, high-dimensional data

j Arise routinely in broad fields of sciences, engineering & evenarts & humanities

j Despite huge interest in big data, there are vast gaps that havefundamentally limited progress in many fields

Motivation & background 2

Page 5: Scaling up Bayesian Inference

Complex & high-dimensional data

j Interest in developing new methods for analyzing & interpretingcomplex, high-dimensional data

j Arise routinely in broad fields of sciences, engineering & evenarts & humanities

j Despite huge interest in big data, there are vast gaps that havefundamentally limited progress in many fields

Motivation & background 2

Page 6: Scaling up Bayesian Inference

‘Typical’ approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization-style methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Bandwagons: many people work on quite similar problems,while critical open problems remain untouched

Motivation & background 3

Page 7: Scaling up Bayesian Inference

‘Typical’ approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization-style methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Bandwagons: many people work on quite similar problems,while critical open problems remain untouched

Motivation & background 3

Page 8: Scaling up Bayesian Inference

‘Typical’ approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization-style methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Bandwagons: many people work on quite similar problems,while critical open problems remain untouched

Motivation & background 3

Page 9: Scaling up Bayesian Inference

‘Typical’ approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization-style methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Bandwagons: many people work on quite similar problems,while critical open problems remain untouched

Motivation & background 3

Page 10: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 11: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 12: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 13: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 14: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 15: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 16: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 17: Scaling up Bayesian Inference

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to be able to handlearbitrarily complex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

j Particular emphasis on scientific applications - limited labeleddata

Motivation & background 4

Page 18: Scaling up Bayesian Inference

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j Often θ is moderate to high-dimensional & the integral in thedenominator is intractable

j Accurate analytic approximations to the posterior have provenelusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms remain the standard

j Scaling MCMC to big & complex settings challenging

Motivation & background 5

Page 19: Scaling up Bayesian Inference

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j Often θ is moderate to high-dimensional & the integral in thedenominator is intractable

j Accurate analytic approximations to the posterior have provenelusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms remain the standard

j Scaling MCMC to big & complex settings challenging

Motivation & background 5

Page 20: Scaling up Bayesian Inference

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j Often θ is moderate to high-dimensional & the integral in thedenominator is intractable

j Accurate analytic approximations to the posterior have provenelusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms remain the standard

j Scaling MCMC to big & complex settings challenging

Motivation & background 5

Page 21: Scaling up Bayesian Inference

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j Often θ is moderate to high-dimensional & the integral in thedenominator is intractable

j Accurate analytic approximations to the posterior have provenelusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms remain the standard

j Scaling MCMC to big & complex settings challenging

Motivation & background 5

Page 22: Scaling up Bayesian Inference

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j Often θ is moderate to high-dimensional & the integral in thedenominator is intractable

j Accurate analytic approximations to the posterior have provenelusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms remain the standard

j Scaling MCMC to big & complex settings challenging

Motivation & background 5

Page 23: Scaling up Bayesian Inference

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j Often θ is moderate to high-dimensional & the integral in thedenominator is intractable

j Accurate analytic approximations to the posterior have provenelusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms remain the standard

j Scaling MCMC to big & complex settings challengingMotivation & background 5

Page 24: Scaling up Bayesian Inference

MCMC & Computational bottlenecks

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Time per iteration increases with # of parameters/unknowns

j Mixing worse as dimension of data increases

j Storing & basic processing on big data sets is problematic

j Usually multiple likelihood and/or gradient evaluations at eachiteration

Motivation & background 6

Page 25: Scaling up Bayesian Inference

MCMC & Computational bottlenecks

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Time per iteration increases with # of parameters/unknowns

j Mixing worse as dimension of data increases

j Storing & basic processing on big data sets is problematic

j Usually multiple likelihood and/or gradient evaluations at eachiteration

Motivation & background 6

Page 26: Scaling up Bayesian Inference

MCMC & Computational bottlenecks

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Time per iteration increases with # of parameters/unknowns

j Mixing worse as dimension of data increases

j Storing & basic processing on big data sets is problematic

j Usually multiple likelihood and/or gradient evaluations at eachiteration

Motivation & background 6

Page 27: Scaling up Bayesian Inference

MCMC & Computational bottlenecks

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Time per iteration increases with # of parameters/unknowns

j Mixing worse as dimension of data increases

j Storing & basic processing on big data sets is problematic

j Usually multiple likelihood and/or gradient evaluations at eachiteration

Motivation & background 6

Page 28: Scaling up Bayesian Inference

MCMC & Computational bottlenecks

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Time per iteration increases with # of parameters/unknowns

j Mixing worse as dimension of data increases

j Storing & basic processing on big data sets is problematic

j Usually multiple likelihood and/or gradient evaluations at eachiteration

Motivation & background 6

Page 29: Scaling up Bayesian Inference

MCMC & Computational bottlenecks

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Time per iteration increases with # of parameters/unknowns

j Mixing worse as dimension of data increases

j Storing & basic processing on big data sets is problematic

j Usually multiple likelihood and/or gradient evaluations at eachiteration

Motivation & background 6

Page 30: Scaling up Bayesian Inference

Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

j Designer MCMC: define clever kernels that solve mixingproblems in high dimensions

j I’ll focus on EP-MCMC & aMCMC in remainder

Motivation & background 7

Page 31: Scaling up Bayesian Inference

Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

j Designer MCMC: define clever kernels that solve mixingproblems in high dimensions

j I’ll focus on EP-MCMC & aMCMC in remainder

Motivation & background 7

Page 32: Scaling up Bayesian Inference

Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

j Designer MCMC: define clever kernels that solve mixingproblems in high dimensions

j I’ll focus on EP-MCMC & aMCMC in remainder

Motivation & background 7

Page 33: Scaling up Bayesian Inference

Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

j Designer MCMC: define clever kernels that solve mixingproblems in high dimensions

j I’ll focus on EP-MCMC & aMCMC in remainder

Motivation & background 7

Page 34: Scaling up Bayesian Inference

Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

j Designer MCMC: define clever kernels that solve mixingproblems in high dimensions

j I’ll focus on EP-MCMC & aMCMC in remainder

Motivation & background 7

Page 35: Scaling up Bayesian Inference

Outline

Motivation & background

EP-MCMC

aMCMC

Discussion

EP-MCMC 8

Page 36: Scaling up Bayesian Inference

Embarrassingly parallel MCMC

j Divide large sample size n data set into many smaller data setsstored on different machines

j Draw posterior samples for each subset posterior in parallelj ‘Magically’ combine the results quickly & simply

EP-MCMC 8

Page 37: Scaling up Bayesian Inference

Embarrassingly parallel MCMC

j Divide large sample size n data set into many smaller data setsstored on different machines

j Draw posterior samples for each subset posterior in parallel

j ‘Magically’ combine the results quickly & simply

EP-MCMC 8

Page 38: Scaling up Bayesian Inference

Embarrassingly parallel MCMC

j Divide large sample size n data set into many smaller data setsstored on different machines

j Draw posterior samples for each subset posterior in parallelj ‘Magically’ combine the results quickly & simply

EP-MCMC 8

Page 39: Scaling up Bayesian Inference

Toy Example: Logistic Regression

200

400

600 800

1000

1200

200

400

600 800

1000

1200

200

400

600

800

100

0

1200

1600

200

400

600

800

100

0

1600

200

400

600

800

100

0 120

0

200

400

600 800

1000 1200 200

400

600

800

1000 120

0

200

400 600

800 1000

1600

200

400

600

800

100

0

1200

1800

200

400

600

800

100

0

1200 160

0

200

400

600

800

1000

120

0

200

400

600

800

1000

120

0

1600

200

400

600 800

1000

120

0

1600

200

400

600

800

1000

1200

200

400

600

800

1000 120

0

1400

200

400

600

800

100

0

140

0

1600

200

400

600

800

100

0

1600

200

400

600

800

1000 120

0

200

400

600

800 1000

1600

200

400 600

800

1000 120

0

1600

200

400

600

800

1000

1600

200

400

600

800

1000

1200

200

400

600

800 100

0

1200

160

0

200

400 600

800

100

0 1200 1600

200

400

600 800

1000

1400

200

400

600

800

1000 120

0

1600

200

400 600

800

1000 120

0

1600

200

400

600 800

1000 1200

200

400 600

800

1000

1200

1400

200

400

600

800

1000

200

400

600

800

100

0

1400

200

400

600

800

100

0

120

0 1800

200

400

600

800 100

0

120

0 1

400

200

400

600 800

1000

1200

200

400

600

800 1000

120

0

1600

200

400

600

800

1000

1200

140

0

200

400

600

800

1000

1200

200

400

600

800

1000

120

0 1600

200

400

600

800 1000 1400

200

400

600

800

1000 120

0

1800

200

400

600

800

1000

120

0

1800

−1.15 −1.10 −1.05 −1.00 −0.95 −0.90 −0.850.85

0.90

0.95

1.00

1.05

1.10

1.15

200

400

600

800

1000 120

0

500

1000

1500

2000

MCMCSubset PosteriorWASP

β1

β 2

pr(yi = 1|xi 1, . . . , xi p ,θ) =exp

(∑pj=1 xi jβ j

)1+exp

(∑pj=1 xi jβ j

) .

j Subset posteriors: ‘noisy’ approximations of full data posterior.

j ‘Averaging’ of subset posteriors reduces this ‘noise’ & leads toan accurate posterior approximation.

EP-MCMC 9

Page 40: Scaling up Bayesian Inference

Toy Example: Logistic Regression

200

400

600 800

1000

1200

200

400

600 800

1000

1200

200

400

600

800

100

0

1200

1600

200

400

600

800

100

0

1600

200

400

600

800

100

0 120

0

200

400

600 800

1000 1200 200

400

600

800

1000 120

0

200

400 600

800 1000

1600

200

400

600

800

100

0

1200

1800

200

400

600

800

100

0

1200 160

0

200

400

600

800

1000

120

0

200

400

600

800

1000

120

0

1600

200

400

600 800

1000

120

0

1600

200

400

600

800

1000

1200

200

400

600

800

1000 120

0

1400

200

400

600

800

100

0

140

0

1600

200

400

600

800

100

0

1600

200

400

600

800

1000 120

0

200

400

600

800 1000

1600

200

400 600

800

1000 120

0

1600

200

400

600

800

1000

1600

200

400

600

800

1000

1200

200

400

600

800 100

0

1200

160

0

200

400 600

800

100

0 1200 1600

200

400

600 800

1000

1400

200

400

600

800

1000 120

0

1600

200

400 600

800

1000 120

0

1600

200

400

600 800

1000 1200

200

400 600

800

1000

1200

1400

200

400

600

800

1000

200

400

600

800

100

0

1400

200

400

600

800

100

0

120

0 1800

200

400

600

800 100

0

120

0 1

400

200

400

600 800

1000

1200

200

400

600

800 1000

120

0

1600

200

400

600

800

1000

1200

140

0

200

400

600

800

1000

1200

200

400

600

800

1000

120

0 1600

200

400

600

800 1000 1400

200

400

600

800

1000 120

0

1800

200

400

600

800

1000

120

0

1800

−1.15 −1.10 −1.05 −1.00 −0.95 −0.90 −0.850.85

0.90

0.95

1.00

1.05

1.10

1.15

200

400

600

800

1000 120

0

500

1000

1500

2000

MCMCSubset PosteriorWASP

β1

β 2

pr(yi = 1|xi 1, . . . , xi p ,θ) =exp

(∑pj=1 xi jβ j

)1+exp

(∑pj=1 xi jβ j

) .

j Subset posteriors: ‘noisy’ approximations of full data posterior.j ‘Averaging’ of subset posteriors reduces this ‘noise’ & leads to

an accurate posterior approximation.EP-MCMC 9

Page 41: Scaling up Bayesian Inference

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

EP-MCMC 10

Page 42: Scaling up Bayesian Inference

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

EP-MCMC 10

Page 43: Scaling up Bayesian Inference

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

EP-MCMC 10

Page 44: Scaling up Bayesian Inference

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

EP-MCMC 10

Page 45: Scaling up Bayesian Inference

Barycenter in Metric Spaces

EP-MCMC 11

Page 46: Scaling up Bayesian Inference

Barycenter in Metric Spaces

EP-MCMC 12

Page 47: Scaling up Bayesian Inference

WAsserstein barycenter of Subset Posteriors (WASP)

Srivastava, Li & Dunson (2015)

j 2-Wasserstein distance between µ,ν ∈P 2(Θ)

W2(µ,ν) = inf{(E[d 2(X ,Y )]

) 12 : law(X ) =µ, law(Y ) = ν

}.

j Πγm(· | Y[ j ]) for j = 1, . . . ,k are combined through WASP

Πγn(· | Y (n)) = argmin

Π∈P 2(Θ)

1

k

k∑j=1

W 22 (Π,Πγ

m(· | Y[ j ])). [Agueh & Carlier (2011)]

j Plugging in Π̂γm(· | Y[ j ]) for j = 1, . . . ,k, a linear program (LP) can

be used for fast estimation of an atomic approximation

EP-MCMC 13

Page 48: Scaling up Bayesian Inference

WAsserstein barycenter of Subset Posteriors (WASP)

Srivastava, Li & Dunson (2015)

j 2-Wasserstein distance between µ,ν ∈P 2(Θ)

W2(µ,ν) = inf{(E[d 2(X ,Y )]

) 12 : law(X ) =µ, law(Y ) = ν

}.

j Πγm(· | Y[ j ]) for j = 1, . . . ,k are combined through WASP

Πγn(· | Y (n)) = argmin

Π∈P 2(Θ)

1

k

k∑j=1

W 22 (Π,Πγ

m(· | Y[ j ])). [Agueh & Carlier (2011)]

j Plugging in Π̂γm(· | Y[ j ]) for j = 1, . . . ,k, a linear program (LP) can

be used for fast estimation of an atomic approximation

EP-MCMC 13

Page 49: Scaling up Bayesian Inference

WAsserstein barycenter of Subset Posteriors (WASP)

Srivastava, Li & Dunson (2015)

j 2-Wasserstein distance between µ,ν ∈P 2(Θ)

W2(µ,ν) = inf{(E[d 2(X ,Y )]

) 12 : law(X ) =µ, law(Y ) = ν

}.

j Πγm(· | Y[ j ]) for j = 1, . . . ,k are combined through WASP

Πγn(· | Y (n)) = argmin

Π∈P 2(Θ)

1

k

k∑j=1

W 22 (Π,Πγ

m(· | Y[ j ])). [Agueh & Carlier (2011)]

j Plugging in Π̂γm(· | Y[ j ]) for j = 1, . . . ,k, a linear program (LP) can

be used for fast estimation of an atomic approximation

EP-MCMC 13

Page 50: Scaling up Bayesian Inference

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

EP-MCMC 14

Page 51: Scaling up Bayesian Inference

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

EP-MCMC 14

Page 52: Scaling up Bayesian Inference

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

EP-MCMC 14

Page 53: Scaling up Bayesian Inference

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

EP-MCMC 14

Page 54: Scaling up Bayesian Inference

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

EP-MCMC 14

Page 55: Scaling up Bayesian Inference

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

EP-MCMC 14

Page 56: Scaling up Bayesian Inference

WASP: Theorems

Theorem (Subset Posteriors)Under “usual” regularity conditions, there exists a constant C1

independent of subset posteriors, such that for large m,

EP [ j ]θ0

W 22

{Πγm(· | Y[ j ]),δθ0 (·)}≤C1

(log2 m

m

) 1α

j = 1, . . . ,k,

Theorem (WASP)Under “usual” regularity conditions and for large m,

W2

{Πγn(· | Y (n)),δθ0 (·)

}=OP (n)

θ0

√log2/αm

km1/α

.

EP-MCMC 15

Page 57: Scaling up Bayesian Inference

WASP: Theorems

Theorem (Subset Posteriors)Under “usual” regularity conditions, there exists a constant C1

independent of subset posteriors, such that for large m,

EP [ j ]θ0

W 22

{Πγm(· | Y[ j ]),δθ0 (·)}≤C1

(log2 m

m

) 1α

j = 1, . . . ,k,

Theorem (WASP)Under “usual” regularity conditions and for large m,

W2

{Πγn(· | Y (n)),δθ0 (·)

}=OP (n)

θ0

√log2/αm

km1/α

.

EP-MCMC 15

Page 58: Scaling up Bayesian Inference

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2015)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can be implemented in STAN, which allows powered likelihoods

EP-MCMC 16

Page 59: Scaling up Bayesian Inference

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2015)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can be implemented in STAN, which allows powered likelihoods

EP-MCMC 16

Page 60: Scaling up Bayesian Inference

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2015)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can be implemented in STAN, which allows powered likelihoods

EP-MCMC 16

Page 61: Scaling up Bayesian Inference

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2015)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can be implemented in STAN, which allows powered likelihoods

EP-MCMC 16

Page 62: Scaling up Bayesian Inference

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2015)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can be implemented in STAN, which allows powered likelihoods

EP-MCMC 16

Page 63: Scaling up Bayesian Inference

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2015)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can be implemented in STAN, which allows powered likelihoods

EP-MCMC 16

Page 64: Scaling up Bayesian Inference

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

EP-MCMC 17

Page 65: Scaling up Bayesian Inference

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

EP-MCMC 17

Page 66: Scaling up Bayesian Inference

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

EP-MCMC 17

Page 67: Scaling up Bayesian Inference

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

EP-MCMC 17

Page 68: Scaling up Bayesian Inference

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

EP-MCMC 17

Page 69: Scaling up Bayesian Inference

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

EP-MCMC 18

Page 70: Scaling up Bayesian Inference

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

EP-MCMC 18

Page 71: Scaling up Bayesian Inference

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

EP-MCMC 18

Page 72: Scaling up Bayesian Inference

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

EP-MCMC 18

Page 73: Scaling up Bayesian Inference

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

EP-MCMC 18

Page 74: Scaling up Bayesian Inference

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

EP-MCMC 18

Page 75: Scaling up Bayesian Inference

Outline

Motivation & background

EP-MCMC

aMCMC

Discussion

aMCMC 19

Page 76: Scaling up Bayesian Inference

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

aMCMC 19

Page 77: Scaling up Bayesian Inference

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

aMCMC 19

Page 78: Scaling up Bayesian Inference

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

aMCMC 19

Page 79: Scaling up Bayesian Inference

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

aMCMC 19

Page 80: Scaling up Bayesian Inference

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

aMCMC 19

Page 81: Scaling up Bayesian Inference

aMCMC Overview

j aMCMC is used routinely in an essentially ad hoc manner

j Our goal: obtain theory guarantees & use these to target designof algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

aMCMC 20

Page 82: Scaling up Bayesian Inference

aMCMC Overview

j aMCMC is used routinely in an essentially ad hoc manner

j Our goal: obtain theory guarantees & use these to target designof algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

aMCMC 20

Page 83: Scaling up Bayesian Inference

aMCMC Overview

j aMCMC is used routinely in an essentially ad hoc manner

j Our goal: obtain theory guarantees & use these to target designof algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

aMCMC 20

Page 84: Scaling up Bayesian Inference

aMCMC Overview

j aMCMC is used routinely in an essentially ad hoc manner

j Our goal: obtain theory guarantees & use these to target designof algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

aMCMC 20

Page 85: Scaling up Bayesian Inference

aMCMC Overview

j aMCMC is used routinely in an essentially ad hoc manner

j Our goal: obtain theory guarantees & use these to target designof algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

aMCMC 20

Page 86: Scaling up Bayesian Inference

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

aMCMC 21

Page 87: Scaling up Bayesian Inference

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

aMCMC 21

Page 88: Scaling up Bayesian Inference

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

aMCMC 21

Page 89: Scaling up Bayesian Inference

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

aMCMC 21

Page 90: Scaling up Bayesian Inference

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

aMCMC 21

Page 91: Scaling up Bayesian Inference

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.

j Applied to Pólya-Gamma data augmentation for logisticregression

j Different V at each iteration – trivial modification to Gibbsj Assumptions hold with high probability for subsets > minimal

size (wrt distribution of subsets, data & kernel).

aMCMC 22

Page 92: Scaling up Bayesian Inference

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.j Applied to Pólya-Gamma data augmentation for logistic

regression

j Different V at each iteration – trivial modification to Gibbsj Assumptions hold with high probability for subsets > minimal

size (wrt distribution of subsets, data & kernel).

aMCMC 22

Page 93: Scaling up Bayesian Inference

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.j Applied to Pólya-Gamma data augmentation for logistic

regressionj Different V at each iteration – trivial modification to Gibbs

j Assumptions hold with high probability for subsets > minimalsize (wrt distribution of subsets, data & kernel).

aMCMC 22

Page 94: Scaling up Bayesian Inference

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.j Applied to Pólya-Gamma data augmentation for logistic

regressionj Different V at each iteration – trivial modification to Gibbsj Assumptions hold with high probability for subsets > minimal

size (wrt distribution of subsets, data & kernel).aMCMC 22

Page 95: Scaling up Bayesian Inference

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

aMCMC 23

Page 96: Scaling up Bayesian Inference

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

aMCMC 23

Page 97: Scaling up Bayesian Inference

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |

j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

aMCMC 23

Page 98: Scaling up Bayesian Inference

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

aMCMC 23

Page 99: Scaling up Bayesian Inference

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

aMCMC 23

Page 100: Scaling up Bayesian Inference

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

aMCMC 23

Page 101: Scaling up Bayesian Inference

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical data

j Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

aMCMC 24

Page 102: Scaling up Bayesian Inference

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs sampler

j Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

aMCMC 24

Page 103: Scaling up Bayesian Inference

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

aMCMC 24

Page 104: Scaling up Bayesian Inference

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

aMCMC 24

Page 105: Scaling up Bayesian Inference

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

aMCMC 24

Page 106: Scaling up Bayesian Inference

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

aMCMC 24

Page 107: Scaling up Bayesian Inference

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

aMCMC 25

Page 108: Scaling up Bayesian Inference

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

aMCMC 25

Page 109: Scaling up Bayesian Inference

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

aMCMC 25

Page 110: Scaling up Bayesian Inference

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

aMCMC 25

Page 111: Scaling up Bayesian Inference

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

aMCMC 25

Page 112: Scaling up Bayesian Inference

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

aMCMC 25

Page 113: Scaling up Bayesian Inference

Applications: General Conclusions

j Achieving uniform control of approximation error ε requiresapproximations adaptive to current state of chain

j More accurate approximations needed farther from highprobability region of posterior; good as chain rarely there

j Approximations to conditionals of vector parameters are highlysensitive to 2nd moment

j Smaller condition numbers for the covariance matrix of vectorparameters mean less accurate approximations can be used

aMCMC 26

Page 114: Scaling up Bayesian Inference

Applications: General Conclusions

j Achieving uniform control of approximation error ε requiresapproximations adaptive to current state of chain

j More accurate approximations needed farther from highprobability region of posterior; good as chain rarely there

j Approximations to conditionals of vector parameters are highlysensitive to 2nd moment

j Smaller condition numbers for the covariance matrix of vectorparameters mean less accurate approximations can be used

aMCMC 26

Page 115: Scaling up Bayesian Inference

Applications: General Conclusions

j Achieving uniform control of approximation error ε requiresapproximations adaptive to current state of chain

j More accurate approximations needed farther from highprobability region of posterior; good as chain rarely there

j Approximations to conditionals of vector parameters are highlysensitive to 2nd moment

j Smaller condition numbers for the covariance matrix of vectorparameters mean less accurate approximations can be used

aMCMC 26

Page 116: Scaling up Bayesian Inference

Applications: General Conclusions

j Achieving uniform control of approximation error ε requiresapproximations adaptive to current state of chain

j More accurate approximations needed farther from highprobability region of posterior; good as chain rarely there

j Approximations to conditionals of vector parameters are highlysensitive to 2nd moment

j Smaller condition numbers for the covariance matrix of vectorparameters mean less accurate approximations can be used

aMCMC 26

Page 117: Scaling up Bayesian Inference

Outline

Motivation & background

EP-MCMC

aMCMC

Discussion

Discussion 27

Page 118: Scaling up Bayesian Inference

Discussion

j Proposed very general classes of scalable Bayes algorithms

j EP-MCMC & aMCMC - fast & scalable with guarantees

j Interest in improving theory - avoid reliance on asymptotics inEP-MCMC & weaken assumptions in aMCMC

j Useful to combine algorithms - e.g., run aMCMC for each subset

j By looking at algorithms through our theory lens, suggests new& improved algorithms

j Also, very interested in hybrid frequentist-Bayes algorithms

Discussion 27

Page 119: Scaling up Bayesian Inference

Discussion

j Proposed very general classes of scalable Bayes algorithms

j EP-MCMC & aMCMC - fast & scalable with guarantees

j Interest in improving theory - avoid reliance on asymptotics inEP-MCMC & weaken assumptions in aMCMC

j Useful to combine algorithms - e.g., run aMCMC for each subset

j By looking at algorithms through our theory lens, suggests new& improved algorithms

j Also, very interested in hybrid frequentist-Bayes algorithms

Discussion 27

Page 120: Scaling up Bayesian Inference

Discussion

j Proposed very general classes of scalable Bayes algorithms

j EP-MCMC & aMCMC - fast & scalable with guarantees

j Interest in improving theory - avoid reliance on asymptotics inEP-MCMC & weaken assumptions in aMCMC

j Useful to combine algorithms - e.g., run aMCMC for each subset

j By looking at algorithms through our theory lens, suggests new& improved algorithms

j Also, very interested in hybrid frequentist-Bayes algorithms

Discussion 27

Page 121: Scaling up Bayesian Inference

Discussion

j Proposed very general classes of scalable Bayes algorithms

j EP-MCMC & aMCMC - fast & scalable with guarantees

j Interest in improving theory - avoid reliance on asymptotics inEP-MCMC & weaken assumptions in aMCMC

j Useful to combine algorithms - e.g., run aMCMC for each subset

j By looking at algorithms through our theory lens, suggests new& improved algorithms

j Also, very interested in hybrid frequentist-Bayes algorithms

Discussion 27

Page 122: Scaling up Bayesian Inference

Discussion

j Proposed very general classes of scalable Bayes algorithms

j EP-MCMC & aMCMC - fast & scalable with guarantees

j Interest in improving theory - avoid reliance on asymptotics inEP-MCMC & weaken assumptions in aMCMC

j Useful to combine algorithms - e.g., run aMCMC for each subset

j By looking at algorithms through our theory lens, suggests new& improved algorithms

j Also, very interested in hybrid frequentist-Bayes algorithms

Discussion 27

Page 123: Scaling up Bayesian Inference

Discussion

j Proposed very general classes of scalable Bayes algorithms

j EP-MCMC & aMCMC - fast & scalable with guarantees

j Interest in improving theory - avoid reliance on asymptotics inEP-MCMC & weaken assumptions in aMCMC

j Useful to combine algorithms - e.g., run aMCMC for each subset

j By looking at algorithms through our theory lens, suggests new& improved algorithms

j Also, very interested in hybrid frequentist-Bayes algorithms

Discussion 27

Page 124: Scaling up Bayesian Inference

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Discussion 28

Page 125: Scaling up Bayesian Inference

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Discussion 28

Page 126: Scaling up Bayesian Inference

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Discussion 28

Page 127: Scaling up Bayesian Inference

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Discussion 28

Page 128: Scaling up Bayesian Inference

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Discussion 28

Page 129: Scaling up Bayesian Inference

What about mixing?

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40Lag

AC

F

Method

DA

MH−MVN

CDA

j In the above we have put aside the mixing issues that can arisein big samples

j Slow mixing → we need many more MCMC samples for thesample MC error

j Common data augmentation algorithms for discrete data failbadly for large imbalanced data (Johndrow et al. 2016)

j But such problems can be fixed via calibration (Duan et al. 2016)

j Interesting area for further research

Discussion 29

Page 130: Scaling up Bayesian Inference

What about mixing?

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40Lag

AC

F

Method

DA

MH−MVN

CDA

j In the above we have put aside the mixing issues that can arisein big samples

j Slow mixing → we need many more MCMC samples for thesample MC error

j Common data augmentation algorithms for discrete data failbadly for large imbalanced data (Johndrow et al. 2016)

j But such problems can be fixed via calibration (Duan et al. 2016)

j Interesting area for further research

Discussion 29

Page 131: Scaling up Bayesian Inference

What about mixing?

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40Lag

AC

F

Method

DA

MH−MVN

CDA

j In the above we have put aside the mixing issues that can arisein big samples

j Slow mixing → we need many more MCMC samples for thesample MC error

j Common data augmentation algorithms for discrete data failbadly for large imbalanced data (Johndrow et al. 2016)

j But such problems can be fixed via calibration (Duan et al. 2016)

j Interesting area for further research

Discussion 29

Page 132: Scaling up Bayesian Inference

What about mixing?

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40Lag

AC

F

Method

DA

MH−MVN

CDA

j In the above we have put aside the mixing issues that can arisein big samples

j Slow mixing → we need many more MCMC samples for thesample MC error

j Common data augmentation algorithms for discrete data failbadly for large imbalanced data (Johndrow et al. 2016)

j But such problems can be fixed via calibration (Duan et al. 2016)

j Interesting area for further research

Discussion 29

Page 133: Scaling up Bayesian Inference

What about mixing?

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40Lag

AC

F

Method

DA

MH−MVN

CDA

j In the above we have put aside the mixing issues that can arisein big samples

j Slow mixing → we need many more MCMC samples for thesample MC error

j Common data augmentation algorithms for discrete data failbadly for large imbalanced data (Johndrow et al. 2016)

j But such problems can be fixed via calibration (Duan et al. 2016)

j Interesting area for further research

Discussion 29

Page 134: Scaling up Bayesian Inference

Primary References

j Duan L, Johndrow J, Dunson DB (2017) Calibrated dataaugmentation for scalable Markov chain Monte Carlo.arXiv:1703.03123.

j Johndrow J, Mattingly J, Mukherjee S, Dunson DB (2015)Approximations of Markov chains and Bayesian inference.arXiv:1508.03387.

j Johndrow J, Smith A, Pillai N, Dunson DB (2016) Inefficiency ofdata augmentation for large sample imbalanced data.arXiv:1605.05798.

j Li C, Srivastava S, Dunson DB (2016) Simple, scalable andaccurate posterior interval estimation. arXiv:1605.04029;Biometrika, in press.

Discussion 30