-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
Michael H. Zhu 1 Chang Liu 2 Jun Zhu 3
AbstractParticle-based Variational Inference methods(ParVIs),
like Stein Variational Gradient Descent,are nonparametric
variational inference methodsthat optimize a set of particles to
best approximatea target distribution. ParVIs have been proposedas
efficient approximate inference algorithms andas potential
alternatives to MCMC methods. How-ever, to our knowledge, the
quality of the posteriorapproximation of particles from ParVIs has
notbeen examined before for large-scale Bayesianinference problems.
We conduct this analysisand evaluate the sample quality of
particles pro-duced by ParVIs, and we find that existing
ParVIapproaches using stochastic gradients convergeinsufficiently
fast under sample quality metrics.We propose a novel variance
reduction and quasi-Newton preconditioning framework for ParVIs,by
leveraging the Riemannian structure of theWasserstein space and
advanced Riemannian opti-mization algorithms. Experimental results
demon-strate the accelerated convergence of variance re-duction and
quasi-Newton methods for ParVIsfor accurate posterior inference in
large-scale andill-conditioned problems.
1. IntroductionA central problem in Bayesian inference is
approximating anintractable posterior distribution p and estimating
intractableexpectations Ep [f(X)] =
∫f(x)p(x) dx with respect to p.
MCMC methods (Brooks et al., 2011) are based on simu-lating a
Markov chain with limiting distribution p, drawingsamples x1, x2, .
. . which represent p, and computing thesample average 1M
∑Mi=1 f(xi) which is an asymptotically
1Department of Computer Science, Stanford University, Stan-ford,
CA, USA 2Microsoft Research Asia, Beijing, 100080, China3Dept. of
Comp. Sci. & Tech., Institute for AI, BNRist
Center,Tsinghua-Bosch ML Center, Tsinghua University, Beijing,
100084,China. Correspondence to: J. Zhu ,Michael H. Zhu .
Proceedings of the 37 th International Conference on
MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020
bythe author(s).
exact estimator of Ep [f(X)] as M → ∞. Variational in-ference
(VI) methods (Wainwright et al., 2008; Blei et al.,2017) recast the
inference problem as a parametric opti-mization problem and attempt
to globally approximate theposterior distribution p with a
tractable distribution fromsome variational family. MCMC methods
are asymptoti-cally exact but can be slow; VI methods can be fast
but aregenerally biased.
Particle-based Variational Inference methods (ParVIs)
arenonparametric variational inference methods that optimize aset
of particles {x1, x2, . . . , xM} to best represent p.
Steinvariational gradient descent (SVGD) (Liu & Wang, 2016)is a
leading instance of ParVIs that has received an increas-ing number
of extensions (e.g., Zhuo et al. (2018); Chenet al. (2018a;b); Wang
et al. (2019)) and applications (e.g.,Feng et al. (2017); Pu et al.
(2017); Liu et al. (2017); Yoonet al. (2018)). ParVIs have been
proposed as efficient ap-proximate inference algorithms potentially
combining theadvantages of MCMC and VI for large-scale Bayesian
infer-ence problems. However, to our knowledge, an
importantquestion has not yet been explored for large-scale
Bayesianinference problems: for a given posterior distribution p,
howwell do the particles {x1, x2, . . . , xM} produced by
ParVIsrepresent p in practice, and how accurate is the
estimator1M
∑Mi=1 f(xi) of Ep [f(X)] in practice?
We explore this question in the context of Bayesian linear
re-gression and logistic regression, two fundamental
real-worldinference tasks. We conduct a careful empirical
inspectionof the sample quality of particles produced by ParVIs
undervarious metrics, including mean squared error for
estimatingposterior mean and covariance, maximum mean discrep-ancy
(Gretton et al., 2012), and kernel Stein discrepancy(Chwialkowski
et al., 2016; Liu et al., 2016). We find thatexisting ParVI
approaches using stochastic gradients con-verge insufficiently fast
under these sample quality metrics,especially in large-scale and
ill-conditioned scenarios.
For accurate posterior inference, highly accurate solutions
tothe ParVI optimization problem are needed. In the contextof
large-scale optimization, a popular method is stochasticgradient
descent (SGD) because of the fast per-iterationcomputation time.
While SGD can reach an approximatesolution relatively quickly, it
has slow asymptotic conver-gence due to the high variance from the
random sampling
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
of data points and the resulting need for decaying step
sizes.Variance reduction methods for stochastic gradient
descent(Roux et al., 2012; Johnson & Zhang, 2013; Defazio et
al.,2014) have been proven, in theory and in practice, to
ac-celerate convergence for strongly convex problems whenhighly
accurate solutions are needed.
For ill-conditioned problems, however, the convergencespeeds of
first-order gradient methods can be slow. One gen-eral solution is
quasi-Newton methods (Nocedal & Wright,2006), like L-BFGS,
which use the history of gradients toapproximate the inverse
Hessian of the objective functionand scale each step by the
approximate inverse Hessianto account for the curvature of the
function. Extendingtraditional, full-batch L-BFGS methods to the
stochastic set-ting (Byrd et al., 2016) is challenging since noisy
Hessianapproximations combined with high-variance stochastic
gra-dients can be unstable. Combining variance reduction
andstochastic quasi-Newton methods (Moritz et al., 2016) leadsto
stable Hessian approximations and low-variance stochas-tic
gradients and has been shown to accelerate convergencewhen highly
accurate solutions are needed.
We propose a novel variance reduction and
quasi-Newtonpreconditioning framework for ParVIs. We follow the
gra-dient flow perspective of ParVIs as optimization methodsfor the
KL divergence on the Wasserstein space, a manifoldof probability
distributions (Chen et al., 2018a; Liu et al.,2019). We develop our
framework for ParVIs by leverag-ing the Riemannian structure of the
Wasserstein space andRiemannian variance reduction (Zhang et al.,
2016; Zhouet al., 2019) and quasi-Newton methods (Roychowdhury
&Parthasarathy, 2017; Kasai et al., 2018) that enjoy
provenacceleration. Our approach is more principled than
intu-itively applying optimization techniques to the update ruleof
each particle, as the ParVI optimization problem is notdefined on
the support space. In handling the Riemannianstructure of the
Wasserstein space, we trade-off the accuracyand computational cost
of geometric approximation and getstable and practical algorithms.
Moreover, as the Wasser-stein optimization perspective is general
for ParVIs, ourframework is applicable to various ParVI instances.
Underthe same experimental setup, our proposed methods
greatlyimprove the convergence rate.
2. Related workVariance reduction Dubey et al. (2016) develop
variancereduction techniques for Stochastic Gradient Langevin
Dy-namics (SGLD), and Chatterji et al. (2018) prove
sharptheoretical bounds showing that variance reduction methodsfor
SGLD converge faster when highly accurate solutionsare needed. Li
et al. (2019) develop variance reductionfor Hamiltonian Monte
Carlo. Zhang et al. (2018b) pro-pose variance reduction techniques
for SPOS (Zhang et al.,
2018a), an algorithm combining SGLD and SVGD. TheirVR-SPOS
algorithm, however, is limited to SPOS since theyrely on a
convergence analysis only applicable to SPOS.
ParVIs and second-order information The Stein varia-tional
Newton method (Detommaso et al., 2018) uses anapproximate
Newton-like update based on a computable ap-proximation to a
functional Newton direction. Their methodexplicitly computes the
Hessian of the log-density, so it isnot a quasi-Newton method. Wang
et al. (2019) presenta generalization of SVGD with matrix-valued
kernels andpropose using preconditioning matrices, such as the
Hessianand Fisher information matrix, to incorporate
geometricinformation into SVGD updates. They propose a
practicalalgorithm based on a weighted average of Hessians at
an-chor points, which is an intuitive approximation. As theHessian
is a local property, distant anchor points do notnecessarily hold
useful and consistent information for theHessian at the current
position. Our method is based on L-BFGS, where the assumption on
the Hessian approximatoris clear (i.e., the matrix satisfying
secant equation that isthe closest to the previous approximator).
L-BFGS variantsalso have convergence bound guarantees (e.g., Kasai
et al.(2018)). Moreover, the two methods above are developedon the
PH manifold (probability space that has RKHS as itstangent space)
(Liu, 2017), which may not be well-defined(Liu et al., 2019) and
only benefits the particular ParVIof SVGD. Another higher-order
method is the RiemannianSVGD method (Liu & Zhu, 2018), which
generalizes SVGDto Riemannian support space (i.e., sample/particle
space).Riemannian SVGD also requires the exact Hessian, so itis not
quasi-Newton. While Riemannian SVGD requires aproperly conceived
metric, our method directly utilizes thegeometry of the Wasserstein
objective and is more flexible.
3. PreliminariesOur variance reduction and quasi-Newton
framework forParVIs is developed in the context of Riemannian
geometry,where we utilize the Riemannian structure of the
Wasser-stein space. We first introduce these related concepts.
3.1. Riemannian Manifold
A manifoldM is a topological space that locally behaveslike a
Euclidean space (i.e., locally homeomorphic to anEuclidean open
subset; see e.g., Do Carmo (1992); Abra-ham et al. (2012)). A
manifold releases linear structuresand generalizes linear space to
allow curvature, while stillcoming with handy structures. To begin
with, a tangentvector at x ∈M admits a general definition as a
directionalderivative operator at x, and all such vectors form a
linearspace regarded as the tangent space TxM at x. When
everytangent space is endowed with an inner product 〈·, ·〉TxM,the
manifold is called a Riemannian manifold, which in-
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
duces more structures. The gradient of a function f at xis the
unique tangent vector such that for any v ∈ TxM,〈grad f(x), v〉TxM
is the directional derivative of f alongv. As the same in the
linear case, it is the steepest ascendingdirection for f at x, thus
being the foundation of Rieman-nian optimization methods. The
length of a smooth curveγ : [a, b]→M can be defined as the integral
over velocity:∫ ba‖γ̇t‖TxM dt where γ̇t denotes the tangent vector
along
the curve at γt, and the curve with minimal length betweenany
adjacent point pair on it is called a geodesic1. The ex-ponential
map Expx(v) is used as the counterpart of vectoraddition to update
points in optimization methods. It trans-ports a point x to another
by walking along the geodesictangent to v ∈ TxM at x for length
‖v‖TxM. The paralleltransport Γyx(v) links tangent vectors at
different points. Itmoves a tangent vector v at x to one at y along
the geodesicfrom x to y, in a certain way that is regarded as
parallel.
3.2. The Wasserstein Space
The (2-)Wasserstein space P2 is the set of distributionson a
support space (i.e., sample/particle space) with finitesecond-order
moments. It is very inclusive and cannot beexpressed with
parametric form. Nevertheless, the structureof its tangent space
makes it convenient to express its el-ements by samples. Here we
consider Euclidean supportspace Rm, and treat the corresponding P2
as an infinitedimensional manifold. Let q be a point on P2, which
is adistribution on Rm, and let {x(i)}Mi=1 be a set of samples,also
called particles, of q. Consider updating the particleswith a
vector field V on Rm (V (x) ∈ Rm,∀x ∈ Rm) foran infinitesimal ε
> 0: {x(i) + εV (x(i))}Mi=1, and denotethe distribution that
this new set of particles obeys as qε.Taking the continuous limit
ε→ 0 and repeatedly applyingthis procedure, the vector field V
induces a smooth curve ofdistributions (qt)t on P2 around q. Such a
vector field V isnot unique for inducing a given distribution curve
aroundq, but all these vector fields form an equivalent class
underthe equivalent relation: U ' V if ∇ · (qU − qV ) = 0where “∇ ·
V ” denotes the divergence of vector field V .In each equivalent
class, the vector field with the mini-mum L2q-norm
√Eq[V · V ] can be taken as the representor
of the class, where “·” denotes the conventional vector in-ner
product. All such representors form a linear subspace
{∇ϕ | ϕ ∈ C∞c }L2q of L2q := {V | Eq[V ·V ]
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
Algorithm 1 Stochastic Variance Reduced Gradient (SVRG) for
ParVIs
Require: Initial particles {x(j)0 }Mj=1, target distribution
p0(x)∏Nn=1 pn(x), update period Ts, learning rate ε.
Require: Vector field estimators Û({x(j)}j)(i) and
V̂n({x(j)}j)(i), parallel transport estimator Γ̂{y(j)}j{x(j)}j
({V (j)}j
)(i).
1: Initialize x̃(i) ← x(i)0 for i = 1, · · · ,M .2: for s = 1,
2, 3, · · · do3: Let x(i)0 ← x̃(i) for i = 1, · · · ,M .4: Let Ṽ
(i) ←
∑Nn=1 V̂n({x̃(j)}j)(i) for i = 1, · · · ,M .
5: for k = 0, · · · , Ts − 1 do6: Uniformly randomly draw a data
point nk ∈ {1, · · · , N}.
7: Let W (i)k ← Û({x(j)k }j)(i) + NV̂nk({x
(j)k }j)(i) − Γ̂
{x(j)k }j{x̃(j)}j
({NV̂nk({x̃(j
′)}j′)(j) − Ṽ (j)}j
)(i)for i =
1, · · · ,M .8: Let x(i)k+1 ← x
(i)k + εW
(i)k for i = 1, · · · ,M .
9: end for10: Let x̃(i) ← x(i)Ts for i = 1, · · · ,M .11: end
for12: return {x̃(i)}Mi=1.
variance since SGD or mini-batch SGD approximates thefull
gradient using a single or small mini-batch of examples.
We propose applying Riemannian SVRG (Zhang et al.,2016) to
reduce the variance of SGD. The main idea ofSVRG is to maintain a
reference snapshot position and acorresponding reference snapshot
full-gradient. In everyiteration, the stochastic gradient at the
snapshot positionminus the snapshot full-gradient is used in the
update ruleas a variance reduction term. The reference snapshot
posi-tion and full-gradient are periodically updated at the start
ofevery outer loop and subsequently used in every iteration inthe
inner loop.
We now consider the derivation of SVRG for ParVIs. Ac-cording to
Riemannian SVRG (Zhang et al., 2016), at thestart of every outer
loop, the current position is recorded asa reference snapshot
position q̃, and the corresponding full-summation over the entire
dataset is computed and stored:Ṽ := V (q̃). In each subsequent
iteration k, the stochas-tic gradient at the current position qk is
combined withthe stochastic gradient at the snapshot position q̃
and thestored full gradient Ṽ to get the variance-reduced
gradi-ent. Concretely, in a usual update step k, for a
uniformlyrandomly chosen data point nk ∈ {1, · · · , N}, the
updatedirection is calculated by Wk := U(qk) + NVnk(qk) −Γqkq̃
(NVnk(q̃)− Ṽ
), which is then used to update the po-
sition: qk+1 := Expqk(εWk). Compared to SGD, we pro-
pose adding the term −Γqkq̃(NVnk(q̃)− Ṽ
)to the update
rule, which leads to a reduction in variance.
Let {x̃(i)}Mi=1 and {x(i)k }Mi=1 be the sets of samples
(par-
ticles) of q̃ and qk, respectively. Then in step k, with
nk chosen, calculate W(i)k := Wk(x
(i)k ) = U(qk)(x
(i)k ) +
NVnk(qk)(x(i)k )− Γ
qkq̃
(NVnk(q̃)− Ṽ
)(x
(i)k ) and update
the particles: x(i)k+1 = x(i)k + εW
(i)k .
To implement the algorithm, we estimate U and Vby ParVIs, which
provide various implementationsof Û({x(j)}j)(i) and
V̂n({x(j)}j)(i) that approximateU(q)(x(i)) and Vn(q)(x(i))
respectively ({x(j)}j is a set ofparticles of q). Let r be another
distribution with particles{y(j)}Mj=1, then the parallel transport
Γrq (V ) (y(i)) (here Vis a general tangent vector at q acting as
the operand of theparallel transport) can also be estimated by the
particles,
and we write the estimator as Γ̂{y(j)}j
{x(j)}j
({V (j)}j
)(i), where
V (j) := V (x(j)). The SVRG for ParVIs algorithm is pre-sented
in Algorithm 1, where the estimators Û({x(j)}j)(i),V̂n({x(j)}j)(i)
for the vector field and Γ̂
{y(j)}j{x(j)}j
({V (j)}j
)(i)for the parallel transport are detailed below.
4.2. Estimators for the Vector Field
Since ParVI methods are derived for a particle-based numer-ical
approximation of the Wasserstein gradient grad KLp(q)in the vector
field form, we leverage these different ways ofapproximation to
derive respective estimators for our vectorfields U and Vn.
SVGD (Liu & Wang, 2016). According to Liu et al. (2019),SVGD
approximates a vector field (element of TqP2 ⊂ L2q)by its
projection onto the vector-valued reproducing kernelHilbert space
(RKHS) Hm of a kernel K. Adopting this
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
notion, we get the vector field estimators based on SVGD:
V̂n({x(j)}j)(i) = max · argmaxW∈Hm,‖W‖Hm=1
〈Vn(q),W 〉L2q
=1
M
∑j
K̂ij∇ log pn(x(j)),
Û({x(j)}j)(i) = max · argmaxW∈Hm,‖W‖Hm=1
〈U(q),W 〉L2q
=1
M
∑j
(K̂ij∇ log p0(x(j)) +∇x(j)K̂ij
),
where K̂ij := K(x(i), x(j)), and “max · argmax”
scalar-multiplies the maximizer with the maximum. The
kernelaveraged gradient of the log density drives the
particlestowards high probability regions of p, while the other
termis a repulsive force between the particles; these two
forcesbalance each other so that the particles approximate p.
Blob (Chen et al., 2018a). The Blob method usesa variational
formulation of the gradient, by refor-mulating −∇ log q as ∇(−
δδqEq[log q]), and approxi-mates q with a smoothed density q̃ := q̂
∗ K,where q̂ denotes the empirical distribution of the par-ticles
and “∗” denotes convolution. The estimatorsare V̂n({x(j)})(i) = ∇
log pn(x(i)), Û({x(j)})(i) =∇ log p0(x(i))−
∑k∇x(i)K̂ik∑
j K̂ij−∑k
∇x(i)
K̂ik∑j K̂jk
.
GFSD (Liu et al., 2019). GFSD directly approximates qin −∇ log q
with the smoothed density q̃. The estimatorsare V̂n({x(j)})(i) = ∇
log pn(x(i)) and Û({x(j)})(i) =∇ log p0(x(i))−
∑k∇x(i)K̂ik∑
j K̂ij.
GFSF (Liu et al., 2019). GFSF identifies −∇ log q as thesolution
of an optimization problem, and then defines anestimator as the
solution of a modified problem by takingq as q̂ and using an RKHS
as the optimization domain.The estimators are V̂n({x(j)})(i) = ∇
log pn(x(i)) andÛ({x(j)})(i) = ∇ log p0(x(i)) +
∑k K̂−1ik
∑j ∇x(j)K̂jk.
4.3. Estimators for the Parallel Transport
Schild’s ladder estimator. The Schild’s laddermethod (Ehlers et
al., 1972; Kheyfets et al., 2000)constructs a first order
approximation to the par-allel transport using the exponential map
and
its inverse on the manifold: Γrq(V ) ≈ Exp−1r
(Expq
(2 Exp−1q
(ExpExpq(V )
(12 Exp
−1Expq(V )
(r)))))
. It
is known (Villani (2008), Coro. 7.22; Ambrosio et al.(2008),
Prop. 8.4.6; Erbar et al. (2010), Prop. 2.1) thatExpq(V ) = (id +V
)#q for absolutely continuous q,which means that if {x(i)}i is a
set of samples of q, then{x(i) + V (x(i))}i is a set of samples of
Expq(V ). The
inverse exponential map Exp−1q (r) (with q absolutelycontinuous)
can be expressed by the optimal transport mapT rq from q to r:
Exp
−1q (r) = T rq − id (Ambrosio et al.
(2008), Prop. 8.4.6). In practice, T rq can be estimated bythe
discrete optimal transport map from the samples {x(i)}iof q to the
samples {y(i)}i of r, which can be done by exactmethods (e.g., Pele
& Werman (2009)) or faster approxi-mate methods like the
Sinkhorn methods (Cuturi, 2013; Xieet al., 2018). Applying these
operations on samples, we get
an implementation of Γ̂{y(j)}j
{x(j)}j
({V (j)}j
)(i).
Pairwise-close estimator. Liu et al. (2019)consider the case
where {x(j)}j and {y(j)}jare pairwise close, i.e., d(x(i), y(i))
�min
{minj 6=i d(x
(i), x(j)),minj 6=i d(y(i), y(j))
}. Un-
der this condition, the discrete optimal transport map canbe
approximated by T rq (x(i)) ≈ y(i) − x(i), and the aboveparallel
transport estimator simplifies to:
Γ̂{y(j)}j{x(j)}j
({V (j)}j
)(i)= V (i).
In our experiments, we use the pairwise-close estimator,which we
observed works well empirically. The pairwise-close version
simplifies the algorithm and computation.
4.4. SPIDER for ParVIs
We propose applying Riemannian SPIDER (Stochastic PathIntegrated
Differential Estimator) (Zhou et al., 2019) as an-other variance
reduction method for ParVIs. SPIDER forParVIs uses a recursive
equation to estimate the full gradi-ent along the trajectory and
employs normalized gradientupdates. In contrast to SVRG, SPIDER
only relies on theprevious position instead of a reference snapshot
positionfor variance reduction. Instead of using a reference
snapshotfull-gradient, SPIDER uses the estimate of the full
gradientat the previous position.
At the start of every outer loop, we first compute a
fullgradient {W (j)0 }Mj=1 over the entire dataset and then ap-ply
normalized gradient ascent. In each of the subsequentiterations k ≥
1, the stochastic gradient at the currentparticle position {x(j)k
}j is combined with the stochas-tic gradient at the previous
particle position {x(j)k−1}j andthe previous estimate of the full
gradient {W (j)k−1}j to getthe new estimate of the full gradient {W
(j)k }j . The parti-cle positions are updated with normalized
gradient ascentusing the formula x(i)k+1 ← x
(i)k + εW
(i)k /‖Wk‖, where
‖Wk‖2 = 1M∑Mj=1 ‖W
(j)k ‖2 is the discretization of the L2q
norm. For space reasons, the SPIDER for ParVIs algorithmis
presented in the supplement.
4.5. Stochastic Quasi-Newton with Variance Reduction(SQN-VR) for
ParVIs
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
Algorithm 2 Stochastic Quasi-Newton with Variance Reduction
(SQN-VR) for ParVIs (simplified under
pairwise-closeapproximation)
Require: Initial particles {x(j)0 }Mj=1, target distribution
p0(x)∏Nn=1 pn(x), number of epochs S, update period Ts,
learning
rates ε1, ε2, L-BFGS memory size.Require: Vector field
estimators Û({x(j)}j)(i) and V̂n({x(j)}j)(i).
1: Initialize x̃(i)1 ← x(i)0 for i = 1, · · · ,M .
2: Let Ṽ (i)1 ←∑Nn=1 V̂n({x̃
(j)1 }j)(i) for i = 1, · · · ,M .
3: for s = 1, 2, 3, · · · , S do4: Let x(i)0 ← x̃
(i)s for i = 1, · · · ,M .
5: for k = 0, · · · , Ts − 1 do6: Sample a data point nk ∈ {1, ·
· · , N}.7: Let W (i)k = Û({x
(j)k }j)(i) +NV̂nk({x
(j)k }j)(i) −
(NV̂nk({x̃
(j′)s }j′)(i) − Ṽ (i)s
)for i = 1, · · · ,M .
8: if s > 2 then9: Compute the quasi-Newton update [Z(j)k
]
Mj=1 from [W
(j)k ]
Mj=1 by L-BFGS two-loop recursion.
10: Let x(i)k+1 ← x(i)k − ε2Z
(i)k for i = 1, · · · ,M .
11: else12: Let x(i)k+1 ← x
(i)k + ε1W
(i)k for i = 1, · · · ,M .
13: end if14: end for15: Let x̃(i)s+1 ← x
(i)Ts
for i = 1, · · · ,M .16: Let Ṽ (i)s+1 ←
∑Nn=1 V̂n({x̃
(j)s+1}j)(i) for i = 1, · · · ,M .
17: Let S(i)s+1 ← x̃(i)s+1 − x̃
(i)s for i = 1, · · · ,M .
18: Let Y (i)s+1 ← Û({x̃(j)s+1}j)(i) + Ṽ
(i)s+1 − Û({x̃
(j)s }j)(i) − Ṽ (i)s for i = 1, · · · ,M .
19: Store the L-BFGS pair ([S(j)s+1]Mj=1, [Y
(j)s+1]
Mj=1), and discard the oldest pair if the memory size is
exceeded.
20: end for21: return {x̃(i)S }Mi=1.
To address ill-conditioned Bayesian inference problems,we
further incorporate quasi-Newton preconditioning tech-niques.
Ill-conditioned problems are typically identified byan
ill-conditioned Hessian of the objective, which makesthe function
landscape distorted along a certain direction.(Quasi-)Newton
preconditioning works by stretching theoptimization space to make
the landscape more isotropic,resulting in longer-sighted updating
direction. In the contextof Bayesian inference, we depict the
ill-conditionedness ac-cordingly by the Hessian of the KL
divergence on P2, whichis now a quadratic form in the tangent space
that general-izes the matrix form to the infinite-dimensional
manifold.According to Example 15.9 of Villani (2008), the
Hessianoperator takes the form
(Hess KLp(q)
)[V ] =
Eq(x)[‖∇V (x)‖2F − V (x)>
(∇∇> log p(x)
)V (x)
](1)
for Euclidean support space, so its ill-conditionedness
isrelated to that of the Hessian matrix∇∇> log p(x).
We propose applying Riemannian Stochastic Quasi-Newtonwith
Variance Reduction (SQN-VR) (Kasai et al., 2018) toParVIs. SQN-VR
builds on SVRG by leveraging curva-ture information to speed up
convergence on ill-conditionedproblems. Like SVRG, SQN-VR computes
the variance-
reduced stochastic gradient at every iteration. In SQN-VR,an
approximation to the inverse Hessian is computed andapplied to the
variance-reduced stochastic gradient to getthe final update
direction.
We present the SQN-VR for ParVIs algorithm in Algorithm2 under
the pairwise-close assumption. Similarly to SVRG,SQN-VR updates the
snapshot position and the correspond-ing full-gradient once in
every outer loop. In addition, thecurvature pair of the QN method
is updated once in everyouter loop, using the difference between
the current andprevious snapshot positions and the difference
between theircorresponding full gradients computed for VR.
In the first two outer loops of SQN-VR, before two
curvaturepairs have been collected, the SQN-VR update rule
eachiteration is the same as the SVRG update rule. After
twocurvature pairs are collected, we apply a quasi-Newton up-date
every iteration instead of directly applying the variancereduced
gradient. Specifically, we use the L-BFGS two-looprecursion
(Nocedal & Wright, 2006; Kasai et al., 2018) withthe previous L
curvature pairs to apply the inverse Hessianapproximation operator
to the variance reduced gradient toget our quasi-Newton update
direction.
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
5. Experimental resultsWe present experimental results on
Bayesian linear regres-sion and logistic regression. We first
describe the experi-mental setup that we use. For the choice of
ParVI, we useSVGD with the linear kernel k(x,x′) = 1d+1 (x
Tx′ + 1),where d is the dimension and with mean centering of
theparticles, which has been proven to yield exact estimationof the
mean and covariance for Gaussian target distributions(Liu &
Wang, 2018). We use 100 particles and a batch sizeof 10 in all of
our experiments. We initialize the particlesfrom a standard
Gaussian, corresponding to the prior.
We compare the following optimization algorithms: Ada-Grad with
momentum, SGD, SVRG, SPIDER, and SQN-VR. We note that AdaGrad is
not a principled Riemmanianoptimization algorithm, but we include
the Euclidean ver-sion of AdaGrad as an empirical algorithm because
Ada-Grad has been used in several SVGD papers. For everyoptimizer,
we tune the learning rate by running a grid searchover
⋃2k=−1{
10k
N ,3×10kN } where N is the number of data
points. For AdaGrad, we additionally tune the learning
ratein⋃5k=3{
10k
N ,3×10kN }, α ∈ {0.9, 0.95, 0.99, 0.999} and the
fudge factor � ∈⋃8k=4{10−k}. For SGD, we decay the
learning rate after each epoch according to the formula�t = a/(t
+ b)
β where the power β ∈ {0.55, 0.75, 0.95}and the constants a and
b are chosen so that the totallearning rate decay over the total
number of epochs is in{1, 3, 10, 30, 100, 300, 1000}. For SVRG and
SPIDER, weuse a constant learning rate for the first half of the
runand decay the learning rate in the second half by a fac-tor in
{1, 3, 10, 30, 100, 300, 1000}. For SQN-VR, we usea constant
learning rate in
⋃0k=−5{10k, 3 × 10k} for the
quasi-Newton updates and a memory size of 10. For all ofthe
variance reduction methods, we update the full gradientover the
entire dataset after each epoch, and we first run10 epochs of SGD.
The grid search for each optimizer con-sists of all combinations of
learning rates and any additionaloptimizer-specific
hyperparameters. To ensure a fair com-parison in all of our
results, the x-axis in our figures is thenumber of passes over the
dataset, specifically the numberof data point gradient evaluations
for all of the particlesdivided by the dataset size N . For each
dataset, we ensurethat every algorithm is initialized with the same
startingpositions for the particles and uses the same sequence
oftraining examples throughout.
To evaluate how well the particles approximate the posterior,we
consider several metrics related to sample quality. Foreach of the
Bayesian linear regression and logistic regres-sion problems, we
first obtain a ground truth set of 40,000MCMC samples from a long
run of No U-Turn Sampler(NUTS) (Hoffman & Gelman, 2014).
Specifically, we usethe implementation of NUTS in PyStan (Carpenter
et al.,2017) with a dense mass matrix, and we run 16 chains of
NUTS with 500 burn-in iterations and 2,500 estimation
it-erations each. Given this reference set of samples, our
firstmetric is Maximum Mean Discrepancy (MMD) (Grettonet al., 2012)
between the 100 ParVI particles and the 40,000MCMC samples. We use
an RBF kernel for MMD with thekernel bandwidth equal to the median
of the pairwise dis-tances between the MCMC samples. From the
ground-truthMCMC samples, we can calculate a ground truth
posteriormean vector µ and covariance matrix Σ. For a ParVI
al-gorithm, let µ̂t be the sample mean and Σ̂t be the
samplecovariance of the set of particles at iteration t. We
definethe mean squared error (MSE) for a ParVI with respect toµ as
1d‖µ̂t − µ‖
22 and with respect to Σ as
1d2 ‖Σ̂t − Σ‖
2F .
Finally, our last metric is kernel Stein discrepancy
(KSD)(Chwialkowski et al., 2016; Liu et al., 2016) for the 100ParVI
particles with respect to the posterior distributionp specified by
∇ log p. We evaluate KSD using the IMQkernel proposed by Gorham
& Mackey (2017), which hasbeen proven to detect convergence and
non-convergence ofa sequence of samples for certain target
distributions.
For each optimizer, we run a grid search over all of
theoptimizer hyperparameters and choose the hyperparametersthat
achieve the minimum MMD at the end of the run as thebest-performing
hyperparameters to show in our results.
For each of the datasets, we report the number of data pointsN ,
the dimensionality D, and the condition number of theposterior
covariance matrix Σ, cond(Σ). Note that cond(Σ)is a computable,
heuristic approximation of the Hessianof the KL on the Wasserstein
space. Equation (1) givesan explicit relationship between the
Hessian of the KL onthe Wasserstein space and the Hessian of the
log-densityof the target posterior. For Bayesian linear regression,
theposterior is Gaussian, so the Hessian of the log-posterioris the
negative inverse posterior covariance matrix, whichhas the same
condition number as the posterior covariancematrix that we report.
For Bayesian logistic regression, theposterior can be approximated
well by a Gaussian.
Bayesian linear regression We first consider a Bayesianlinear
regression model where the prior on the regressioncoefficients is
standard Gaussian. We run experiments on8 UCI regression datasets
(Dua & Graff, 2019). For spacereasons, we show results for 3 of
the datasets in Fig. 1 andpresent all of the results in the
supplement.
In Fig. 1(a), we show results for the noise dataset, which is
asmall dataset with 1,503 examples, a low dimensionality of6, and a
low posterior covariance matrix condition numberof 12. After
running 100 epochs of each optimizer withextensive tuning, we see
that the best-performing Adagradand SGD optimizers achieve a MMD of
around 10−0.85. Incontrast, all of the variance reduction
algorithms achievea MMD of 10−1.38 to 10−1.63. The variance
reduction al-
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
0 20 40 60 80 100
# grad / N
−1.50
−1.25
−1.00
−0.75
−0.50
−0.25
0.00
log10(MMD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−7
−6
−5
−4
−3
−2
log10(MSEmu)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−10
−8
−6
−4
−2
log10(MSEcov)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
0.0
0.5
1.0
1.5
2.0
2.5
log10(KSD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
−0.4
−0.2
log10(MMD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−4.0
−3.5
−3.0
−2.5
−2.0
log10(MSEmu)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−4.5
−4.0
−3.5
−3.0
−2.5
−2.0
log10(MSEcov)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
1.5
2.0
2.5
3.0
log10(KSD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 100 200 300 400 500
# grad / N
−1.4
−1.2
−1.0
−0.8
−0.6
−0.4
−0.2
log10(MMD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 100 200 300 400 500
# grad / N
−6
−5
−4
−3
−2
log10(MSEmu)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 100 200 300 400 500
# grad / N
−7
−6
−5
−4
−3
−2
log10(MSEcov)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 100 200 300 400 500
# grad / N
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
log10(KSD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
Figure 1. Experimental results for Bayesian linear regression,
from top to bottom: (a) noise (b) parkinson (c) toms.
gorithms also achieve a much lower MSE than AdaGradand SGD for
estimating µ (10−5.76 to 10−6.70 comparedto 10−4.06 to 10−4.67) and
Σ (10−8.66 to 10−9.43 comparedto 10−5.95 to 10−6.79), resulting in
much more accurateposterior mean and covariance estimates. The KSD
metricprovides further evidence that the variance reduction
algo-rithms produce particles with much higher sample quality.
In Figs. 1(b) and 1(c), we consider two more challengingdatasets
with significantly higher posterior covariance ma-trix condition
numbers. In Fig. 1(b), for the parkinsondataset with N = 5875, D =
21, cond(Σ)=65697, we seethat SQN-VR performs the best after 100
epochs with aMMD of 10−1.56; Adagrad, SGD, and SPIDER achieve aMMD
of around 10−1.3, and SVRG achieves a MMD ofaround 10−1.06. Thus,
we see that variance reduction alonemight not improve over
well-tuned SGD for ill-conditionedproblems. If we compare Adagrad,
SGD, and SPIDER interms of MSE for µ and Σ, we notice that SPIDER
performsthe best for estimating µ and the worst for estimating
Σ,Adagrad performs the best at estimating Σ and the worst
forestimating µ, and SGD is in between. While these 3 meth-ods
produce particles with similar MMD, the distributionsof the
particles are very different, reflected in the differingestimates
for µ and Σ. Interestingly, the KSD metric sug-gests that the
particles from SGD and Adagrad have worsequality than the particles
from SPIDER and SVRG; this
might be due to the less stable optimization procedure. InFig.
1(c), we present results for the toms dataset which has28,179 data
points, a high dimensionality of 97, and a highposterior covariance
matrix condition number of 45,923.After 500 epochs, the
best-performing SQN-VR achievesa MMD of 10−1.34, Adagrad achieves a
MMD of 10−0.52,and the other optimizers achieve a MMD no better
than10−0.18. For this ill-conditioned problem, we see that SQN-VR
is essential for fast convergence and accurate
posteriorinference.
Bayesian logistic regression We consider a Bayesian lo-gistic
regression model for binary classification where theprior on the
regression coefficients is standard Gaussian. Werun 8 Bayesian
logistic regression experiments. For spacereasons, we show results
for MNIST and covtype in Fig. 2and present all of the results in
the supplement.
Our MNIST (LeCun et al., 1998) binary classification prob-lem is
classifying digits 7 vs. 9 after applying PCA to reducethe
dimension of the image to 50, similar to Korattikara et al.(2014).
The MNIST dataset has 12,214 training examplesand a low posterior
covariance matrix condition numberof 58. In Fig. 2(a), we see that
all of the variance reduc-tion algorithms perform well, achieving a
MMD of around10−1.85. In contrast, the best-performing AdaGrad
achievesa MMD of 10−0.81 and SGD achieves a MMD of 10−1.01.
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
0 20 40 60 80 100
# grad / N
−1.5
−1.0
−0.5
0.0
log10(MMD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−7
−6
−5
−4
−3
−2
−1
log10(MSEmu)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−8
−7
−6
−5
−4
−3
log10(MSEcov)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
0.5
1.0
1.5
2.0
2.5
3.0
log10(KSD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−1.75
−1.50
−1.25
−1.00
−0.75
−0.50
−0.25
0.00
log10(MMD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−6
−5
−4
−3
−2
−1
0
1
log10(MSEmu)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
−5
−4
−3
−2
log10(MSEcov)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
0 20 40 60 80 100
# grad / N
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
log10(KSD)
SQN-VR
SGD
SVRG
AdaGrad
SPIDER
Figure 2. Experimental results for Bayesian logistic regression,
from top to bottom: (a) mnist (b) covtype.
The covtype dataset has 464,809 examples (using a 80%training
split), 55 dimensions, and a high posterior covari-ance matrix
condition number of 341,266. In Fig. 2(b), wesee that SQN-VR and
SPIDER perform the best, achiev-ing a MMD of around 10−1.7, with
SVRG following closebehind with a MMD of around 10−1.64. Without
variancereduction, the best-performing SGD achieves a MMD of10−1.19
and AdaGrad achieves a MMD of 10−0.19. Look-ing at the other
metrics, we observe that the particles fromSGD approximate Σ well
but approximate µ poorly andare also worse in terms of KSD. Thus,
we see that variancereduction techniques can greatly accelerate the
convergenceof ParVIs for real-world datasets of varying size.
6. DiscussionOur experimental results on Bayesian linear
regressionand logistic regression demonstrate that existing ParVI
ap-proaches using stochastic gradients converge insufficientlyfast
and that variance reduction and quasi-Newton methodscan greatly
accelerate the convergence of ParVIs for accu-rate posterior
inference in large-scale and ill-conditionedproblems. While using
variance reduction techniques alonesped up convergence in many
large-scale problems, com-bining variance reduction and
quasi-Newton techniques ledto significantly faster convergence in
several cases and thebest performance on every dataset we
considered. Our al-gorithms are applicable to general ParVIs and
are based onprincipled Riemannian optimization algorithms.
From the perspective of posterior inference, our new meth-ods
produced a set of particles with significantly better sam-ple
quality, as measured by MMD and KSD, and betterestimates of
posterior expectations, such as mean and co-variance. Accurate
posterior inference requires solving the
ParVI optimization problem to a high degree of accuracy,
soleveraging Riemannian optimization methods with fast con-vergence
and high accuracy is very important. While we dida large grid
search to tune the hyperparameters, additionaltuning of the
hyperparameters and other techniques, such asadaptive learning
rates and mini-batch sizes, could furtherimprove the performance of
the optimization algorithms.
In our experiments, we assumed the pairwise close condi-tion,
which we observed works well empirically. Underthis assumption, our
methods are simple, easy to use, fastin terms of running time, and
work well in practice. Inour experiments, we observed that the
running times of ourmethods are generally comparable to or slightly
faster thanSGD and AdaGrad given the same number of gradient
eval-uations. The relative order of running times was generallySVRG
≤ SPIDER ≤ SQN-VR ≤ SGD ≤ AdaGrad. Forexample, on an Intel Xeon
E5-2640v3, 100 epochs on thecovtype dataset took 20 minutes for
SVRG and SPIDER,22 for SQN-VR, 24 for SGD, and 28 for AdaGrad.
We focused our experiments on Bayesian linear regressionand
logistic regression, running SVGD with a linear ker-nel, which
works well for Gaussian-like posteriors. In thissetting, we
observed that ParVI methods can be highly ac-curate for estimating
posterior expectations and producinga small set of particles which
represent the posterior whilebeing fast in terms of running time.
As an example, onthe challenging covtype dataset, our ParVI
implementationtook 22 minutes while 500 burn-in iterations of NUTS
took4.5 hours. Using a subset of 100 NUTS samples also givesa very
poor representation of the posterior. Future workinvolves studying
how well various ParVI methods approx-imate various posterior
distributions under sample qualitymetrics to further improve ParVI
methods for real-worldBayesian inference problems.
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
AcknowledgementJ.Z was supported by the National Key Research
and Devel-opment Program of China (No. 2017YFA0700904),
NSFCProjects (Nos. 61620106010), Beijing Academy of Artifi-cial
Intelligence (BAAI), Tsinghua-Huawei Joint ResearchProgram, and a
grant from Tsinghua Institute for Guo Qiang.
ReferencesAbraham, R., Marsden, J. E., and Ratiu, T.
Manifolds,
tensor analysis, and applications, volume 75. SpringerScience
& Business Media, New York, 2012.
Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows:
inmetric spaces and in the space of probability measures.Springer
Science & Business Media, 2008.
Benamou, J.-D. and Brenier, Y. A computational fluid me-chanics
solution to the Monge-Kantorovich mass transferproblem. Numerische
Mathematik, 84(3):375–393, 2000.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Varia-tional
inference: A review for statisticians. Journal ofthe American
Statistical Association, 112(518):859–877,2017.
Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Hand-book of
Markov Chain Monte Carlo. CRC Press, 2011.
Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y.
Astochastic quasi-Newton method for large-scale optimiza-tion. SIAM
Journal on Optimization, 26(2):1008–1031,2016.
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D.,Goodrich, B.,
Betancourt, M., Brubaker, M., Guo, J.,Li, P., and Riddell, A. Stan:
A probabilistic programminglanguage. Journal of Statistical
Software, 76(1), 2017.
Chatterji, N., Flammarion, N., Ma, Y., Bartlett, P., and
Jor-dan, M. On the theory of variance reduction for
stochasticgradient Monte Carlo. In Proceedings of the 35th
Interna-tional Conference on Machine Learning, volume 80,
pp.764–773, Stockholmsmässan, Stockholm Sweden, 10–15Jul 2018.
Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L.A unified
particle-optimization framework for scalableBayesian sampling. In
Proceedings of the Conference onUncertainty in Artificial
Intelligence (UAI 2018), Mon-terey, California USA, 2018a.
Association for Uncertaintyin Artificial Intelligence.
Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates,C.
J. Stein points. arXiv preprint arXiv:1803.10161,2018b.
Chwialkowski, K., Strathmann, H., and Gretton, A. Akernel test
of goodness of fit. In Proceedings of the 33rdInternational
Conference on Machine Learning (ICML2016), pp. 2606–2615, New York,
New York USA, 2016.IMLS.
Cuturi, M. Sinkhorn distances: Lightspeed computationof optimal
transport. In Advances in Neural InformationProcessing Systems, pp.
2292–2300, Lake Tahoe, NevadaUSA, 2013. NIPS Foundation.
Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: Afast
incremental gradient method with support for non-strongly convex
composite objectives. In Advances inneural information processing
systems, pp. 1646–1654,2014.
Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., andScheichl,
R. A Stein variational Newton method. InAdvances in Neural
Information Processing Systems, pp.9187–9197, Montréal, Canada,
2018. NIPS Foundation.
Do Carmo, M. P. Riemannian Geometry. Birkhäuser, 1992.
Dua, D. and Graff, C. UCI Machine Learning Repository,2019. URL
http://archive.ics.uci.edu/ml.
Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B.,Smola,
A. J., and Xing, E. P. Variance reduction instochastic gradient
Langevin dynamics. In Advancesin neural information processing
systems, pp. 1154–1162,2016.
Ehlers, J., Pirani, F., and Schild, A. The geometry of free
falland light propagation, in the book “General Relativity”(papers
in honour of JL Synge), 63–84, 1972.
Erbar, M. et al. The heat equation on manifolds as a
gradientflow in the Wasserstein space. In Annales de
l’InstitutHenri Poincaré, Probabilités et Statistiques, volume
46,pp. 1–23. Institut Henri Poincaré, 2010.
Feng, Y., Wang, D., and Liu, Q. Learning to draw sam-ples with
amortized Stein variational gradient descent. InProceedings of the
Conference on Uncertainty in Arti-ficial Intelligence (UAI 2017),
Sydney, Australia, 2017.Association for Uncertainty in Artificial
Intelligence.
Gorham, J. and Mackey, L. Measuring sample quality withkernels.
arXiv preprint arXiv:1703.01717, 2017.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf,B., and
Smola, A. A kernel two-sample test. Journal ofMachine Learning
Research, 13(Mar):723–773, 2012.
Hoffman, M. D. and Gelman, A. The No-U-turn sampler:adaptively
setting path lengths in Hamiltonian MonteCarlo. Journal of Machine
Learning Research, 15(1):1593–1623, 2014.
http://archive.ics.uci.edu/ml
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
Johnson, R. and Zhang, T. Accelerating stochastic
gradientdescent using predictive variance reduction. In Advancesin
neural information processing systems, pp. 315–323,2013.
Kasai, H., Sato, H., and Mishra, B. Riemannian
stochasticquasi-Newton algorithm with variance reduction and
itsconvergence analysis. In International Conference onArtificial
Intelligence and Statistics, pp. 269–278, 2018.
Kheyfets, A., Miller, W. A., and Newton, G. A. Schild’sladder
parallel transport procedure for an arbitrary con-nection.
International Journal of Theoretical Physics, 39(12):2891–2898,
2000.
Korattikara, A., Chen, Y., and Welling, M. Austerity inMCMC
land: Cutting the Metropolis-Hastings budget.In International
Conference on Machine Learning, pp.181–189, 2014.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.
Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE, 86(11):2278–2324, 1998.
Li, Z., Zhang, T., Cheng, S., Zhu, J., and Li, J.
Stochasticgradient Hamiltonian Monte Carlo with variance reduc-tion
for Bayesian inference. Machine Learning, 108(8-9):1701–1727,
2019.
Liu, C. and Zhu, J. Riemannian Stein variational gradientdescent
for Bayesian inference. In The 32nd AAAI Con-ference on Artificial
Intelligence, pp. 3627–3634, NewOrleans, Louisiana USA, 2018. AAAI
press.
Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., and Carin,L.
Understanding and accelerating particle-based varia-tional
inference. In Proceedings of the 36th InternationalConference on
Machine Learning, pp. 4082–4092, LongBeach, California USA, 2019.
IMLS.
Liu, Q. Stein variational gradient descent as gradient flow.
InAdvances in Neural Information Processing Systems, pp.3118–3126,
Long Beach, California USA, 2017. NIPSFoundation.
Liu, Q. and Wang, D. Stein variational gradient descent:A
general purpose Bayesian inference algorithm. In Ad-vances in
Neural Information Processing Systems, pp.2370–2378, Barcelona,
Spain, 2016. NIPS Foundation.
Liu, Q. and Wang, D. Stein variational gradient descent asmoment
matching. In Advances in Neural InformationProcessing Systems 31,
pp. 8854–8863. 2018.
Liu, Q., Lee, J. D., and Jordan, M. I. A kernelized
Steindiscrepancy for goodness-of-fit tests. In Proceedings ofthe
33rd International Conference on Machine Learning(ICML 2016), New
York, New York USA, 2016. IMLS.
Liu, Y., Ramachandran, P., Liu, Q., and Peng, J.
Steinvariational policy gradient. In Proceedings of the Confer-ence
on Uncertainty in Artificial Intelligence (UAI 2017),Sydney,
Australia, 2017. Association for Uncertainty inArtificial
Intelligence.
Moritz, P., Nishihara, R., and Jordan, M. A linearly-convergent
stochastic L-BFGS algorithm. In ArtificialIntelligence and
Statistics, pp. 249–258, 2016.
Nocedal, J. and Wright, S. Numerical optimization.
SpringerScience & Business Media, 2006.
Otto, F. The geometry of dissipative evolution equations:the
porous medium equation. 2001.
Pele, O. and Werman, M. Fast and robust earth mover’sdistances.
In Proceedings of the 12th International Con-ference on Computer
Vision (ICCV-09), volume 9, pp.460–467, Kyoto, Japan, 2009.
IEEE.
Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L.VAE
learning via Stein variational gradient descent. InAdvances in
Neural Information Processing Systems, pp.4239–4248, Long Beach,
California USA, 2017. NIPSFoundation.
Roux, N. L., Schmidt, M., and Bach, F. R. A stochasticgradient
method with an exponential convergence rate forfinite training
sets. In Advances in neural informationprocessing systems, pp.
2663–2671, 2012.
Roychowdhury, A. and Parthasarathy, S. Acceleratedstochastic
quasi-Newton optimization on Riemann mani-folds. arXiv preprint
arXiv:1704.01700, 2017.
Villani, C. Optimal transport: old and new, volume 338.Springer
Science & Business Media, 2008.
Wainwright, M. J., Jordan, M. I., et al. Graphical
models,exponential families, and variational inference.
Founda-tions and Trends R© in Machine Learning,
1(1–2):1–305,2008.
Wang, D., Tang, Z., Bajaj, C., and Liu, Q. Stein
variationalgradient descent with matrix-valued kernels. In
Advancesin neural information processing systems, pp.
7834–7844,2019.
Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximalpoint
method for computing Wasserstein distance. arXivpreprint
arXiv:1802.04307, 2018.
Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn,
S.Bayesian model-agnostic meta-learning. In Advances inNeural
Information Processing Systems, pp. 7343–7353,Montréal, Canada,
2018. NIPS Foundation.
-
Variance Reduction and Quasi-Newton for Particle-Based
Variational Inference
Zhang, H., Reddi, S. J., and Sra, S. Riemannian SVRG:Fast
stochastic optimization on Riemannian manifolds. InAdvances in
Neural Information Processing Systems, pp.4592–4600, Barcelona,
Spain, 2016. NIPS Foundation.
Zhang, J., Zhang, R., and Chen, C. Stochastic
particle-optimization sampling and the non-asymptotic conver-gence
theory. arXiv preprint arXiv:1809.01293, 2018a.
Zhang, J., Zhao, Y., and Chen, C. Variance reduction
instochastic particle-optimization sampling. arXiv
preprintarXiv:1811.08052, 2018b.
Zhou, P., Yuan, X., Yan, S., and Feng, J. Faster
first-ordermethods for stochastic non-convex optimization on
Rie-mannian manifolds. IEEE transactions on pattern analy-sis and
machine intelligence, 2019.
Zhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang,B.
Message passing Stein variational gradient descent.In Proceedings
of the 35th International Conference onMachine Learning, pp.
6018–6027, 2018.