-
This is an electronic reprint of the original article.This
reprint may differ from the original in pagination and typographic
detail.
Powered by TCPDF (www.tcpdf.org)
This material is protected by copyright and other intellectual
property rights, and duplication or sale of all or part of any of
the repository collections is not permitted, except that material
may be duplicated by you for your research use or educational
purposes in electronic or print form. You must obtain permission
for any other use. Electronic or print copies may not be offered,
whether for sale or otherwise to anyone who is not an authorised
user.
Hegde, Pashupati; Heinonen, Markus; Lähdesmäki, Harri; Kaski,
SamuelDeep learning with differential Gaussian process flows
Published in:The 22nd International Conference on Artificial
Intelligence and Statistic
Published: 01/04/2019
Document VersionPeer reviewed version
Please cite the original version:Hegde, P., Heinonen, M.,
Lähdesmäki, H., & Kaski, S. (2019). Deep learning with
differential Gaussian processflows. In The 22nd International
Conference on Artificial Intelligence and Statistic (Vol. 89, pp.
1-15).(Proceedings of Machine Learning Research; Vol. 89).
https://arxiv.org/abs/1810.04066
https://arxiv.org/abs/1810.04066
-
Deep learning with differential Gaussian process flows
Pashupati Hegde Markus Heinonen Harri Lähdesmäki Samuel
KaskiDepartment of Computer Science, Aalto universityHelsinki
Institute for Information Technology HIIT
Abstract
We propose a novel deep learning paradigmof differential flows
that learn a stochastic dif-ferential equation transformations of
inputsprior to a standard classification or regres-sion function.
The key property of differentialGaussian processes is the warping
of inputsthrough infinitely deep, but infinitesimal, dif-ferential
fields, that generalise discrete layersinto a dynamical system. We
demonstrate ex-cellent results as compared to deep
Gaussianprocesses and Bayesian neural networks.
1 INTRODUCTION
Gaussian processes are a family of flexible kernel func-tion
distributions (Rasmussen and Williams, 2006).The capacity of kernel
models is inherently determinedby the function space induced by the
choice of the ker-nel, where standard stationary kernels lead to
modelsthat underperform in practice. Shallow – or single –Gaussian
processes are often suboptimal since flexi-ble kernels that would
account for the non-stationaryand long-range connections of the
data are difficultto design and infer. Such models have been
proposedby introducing non-stationary kernels (Tolvanen et
al.,2014; Heinonen et al., 2016), kernel compositions (Du-venaud et
al., 2011; Sun et al., 2018), spectral kernels(Wilson et al., 2013;
Remes et al., 2017), or by ap-plying input-warpings (Snoek et al.,
2014) or output-warpings (Snelson et al., 2004; Lázaro-Gredilla,
2012).Recently, Wilson et al. (2016) proposed to transformthe
inputs with a neural network prior to a Gaussianprocess model. The
new neural input representationcan extract high-level patterns and
features, however,it employs rich neural networks that require
carefuldesign and optimization.
Proceedings of the 22nd International Conference on Ar-tificial
Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan.
PMLR: Volume 89. Copyright 2019 bythe author(s).
Deep Gaussian processes elevate the performance ofGaussian
processes by mapping the inputs throughmultiple Gaussian process
‘layers’ (Damianou andLawrence, 2013; Salimbeni and Deisenroth,
2017), or asa network of GP nodes (Duvenaud et al., 2011; Wilsonet
al., 2012; Sun et al., 2018). However, deep GPsresult in degenerate
models if the individual GPs arenot invertible, which limits their
capacity (Duvenaudet al., 2014).
In this paper we propose a novel paradigm of
learningcontinuous-time transformations or flows of the datainstead
of learning a discrete sequence of layers. Weapply stochastic
differential equation systems in theoriginal data space to
transform the inputs before aclassification or regression layer.
The transformationflow consists of an infinite path of
infinitesimal steps.This approach turns the focus from learning
iterativefunction mappings to learning input representations inthe
original feature space, avoiding learning new featurespaces. A
TensorFlow compatible implementation willbe made available upon
acceptance.
Our experiments show excellent prediction performanceon a number
of benchmark datasets on classificationand regression. The
performance of the proposed modelis comparable to that of other
Bayesian approaches,including deep Gaussian processes.
2 BACKGROUND
We begin by summarising useful background of Gaus-sian processes
and continuous-time dynamicals models.
2.1 Gaussian processes
Gaussian processes (GP) are a family of Bayesian mod-els that
characterise distributions of functions (Ras-mussen and Williams,
2006). A zero-mean Gaussianprocess prior on a function f(x) over
vector inputsx ∈ RD,
f(x) ∼ GP(0,K(x,x′)), (1)
-
Deep learning with differential Gaussian process flows
defines a prior distribution over function values f(x)whose mean
and covariances are
E[f(x)] = 0 (2)cov[f(x), f(x′)] = K(x,x′). (3)
A GP prior defines that for any collection of N inputs,X = (x1,
. . . ,xN )
T , the corresponding function valuesf = (f(x1), . . . , f(xN
))
T ∈ RN follow a multivariatenormal distribution
f ∼ N (0,K), (4)
where K = (K(xi,xj))Ni,j=1 ∈ RN×N is the kernel
matrix. The key property of GP’s is that output pre-dictions
f(x) and f(x′) correlate depending on howsimilar are their inputs x
and x′, as measured by thekernel K(x,x′) ∈ R.
We consider sparse Gaussian process functions by aug-menting the
Gaussian process with a small numberM of inducing ‘landmark’
variables u = f(z) (Snelsonand Ghahramani, 2006). We condition the
GP priorwith the inducing variables u = (u1, . . . , uM )
T ∈ RMand Z = (z1, . . . , zM )
T to obtain the GP posteriorpredictions at data points
f |u; Z ∼ N (Qu,KXX −QKZZQT ) (5)u ∼ N (0,KZZ), (6)
where Q = KXZK−1ZZ, and where KXX ∈ RN×N is
the kernel between observed image pairs X ×X, thekernel KXZ ∈
RN×M is between observed images Xand inducing images Z, and kernel
KZZ ∈ RM×M isbetween inducing images Z×Z. The inference problemof
sparse Gaussian processes is to learn the parame-ters θ of the
kernel (such as the lengthscale), and theconditioning inducing
variables u,Z.
2.2 Stochastic differential equations
Stochastic differential equations (SDEs) are an effec-tive
formalism for modelling continuous-time systemswith underlying
stochastic dynamics, with wide rangeof applications (Friedrich et
al., 2011). We considermultivariate continuous-time systems
governed by aMarkov process xt described by SDE dynamics
dxt = µ(xt)dt+√
Σ(xt)dWt, (7)
where xt ∈ RD is the state vector of a D-dimensionaldynamical
system at continuous time t ∈ R, µ(xt) ∈RD is a deterministic state
evolution vector field,√
Σ(xt) ∈ RD×D is the diffusion matrix field of thestochastic
multivariate Wiener process Wt ∈ RD. The√
Σ(xt) is the square root matrix of a covariance ma-
trix Σ(xt), where we assume Σ(xt) =√
Σ(xt)√
Σ(xt)
Figure 1: An example vector field defined by the induc-ing
vectors (a) results in the ODE flow solutions (b)of a 2D system.
Including the colored Wiener diffusion(c) leads to SDE trajectory
distributions (d).
holds. A Wiener process has zero initial state W0 = 0,and
independent, Gaussian increments Wt+s −Wt ∼N (0, sID) over time
with standard deviation
√sID
(See Figure 1).
The SDE system (7) transforms states xt forward incontinuous
time by the deterministic drift functionµ : RD → RD, while the
diffusion Σ : RD → RD×Dis the scale of the random Brownian motion
Wt thatscatter the state xt with random fluctuations. Thestate
solutions of an SDE are given by the stochasticItô integral
(Oksendal, 2014)
xt = x0 +
∫ t0
µ(xτ )dτ +
∫ t0
√Σ(xτ )dWτ , (8)
where we integrate the system state from an initial statex0 for
time t forward, and where τ is an auxiliary timevariable. SDEs
produce continuous, but non-smoothtrajectories x0:t over time due
to the non-differentiableBrownian motion. This causes the SDE
system tonot have a time derivative dxtdt , but the stochastic
Itôintegral (8) can still be defined.
The only non-deterministic part of the solution (8) isthe
Brownian motion Wτ , whose random realisationsgenerate path
realisations x0:t that induce state distri-butions
xt ∼ pt(x;µ,Σ,x0) (9)
at any instant t, given the drift µ and diffusion Σ frominitial
state x0. The state distribution is the solution
-
Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, Samuel
Kaski
(a) Sparse GP (b) Deep GP (c) Differentially deep GP
Figure 2: The sparse Gaussian processes uncouples the
observations through global inducing variables ug (a).Deep Gaussian
process is a hierarchical model with a nested composition of
Gaussian processes introducing layerdependency (b). In our
formulation deepness is introduced as a temporal dependency across
states xi(t) (indicatedby dashed line) with a GP prior over their
differential function value fi (c).
to the Fokker-Planck-Kolmogorov partial differentialequation,
which is intractable for general non-lineardrift and diffusion.
In practise the Euler-Maruyama (EM) numerical solvercan be used
to simulate trajectory samples from thestate distribution (Yildiz
et al., 2018) (See Figure 1d).We assume a fixed time discretisation
t1, . . . , tN with∆t = tN/N being the time window (Higham,
2001).The EM method at tk is
xk+1 = xk + µ(xk)∆t+√
Σ(xk)∆Wk, (10)
where ∆Wk = Wk+1−Wk ∼ N (0,∆tID) with standarddeviation
√∆t. The EM increments ∆xk = xk+1 − xk
correspond to samples from a Gaussian
∆xk ∼ N (µ(xk)∆t,Σ(xk)∆t). (11)
Then, the full N length path is determined from the
Nrealisations of the Wiener process, each of which is a
D-dimensional. More efficient high-order approximationshave also
been developed (Kloeden and Platen, 1992;Lamba et al., 2006).
SDE systems are often constructed by manually defin-ing drift
and diffusion functions to model specific sys-tems in finance,
biology, physics or in other domains(Friedrich et al., 2011).
Recently, several works haveproposed learning arbitrary drift and
diffusion func-tions from data (Papaspiliopoulos et al., 2012;
Garćıaet al., 2017; Yildiz et al., 2018).
3 DEEP DIFFERENTIALGAUSSIAN PROCESS
In this paper we propose a paradigm of continuous-time deep
learning, where inputs xi are not treated asconstant, but are
instead driven by an SDE system.We propose a continuous-time deep
Gaussian processmodel through infinite, infinitesimal differential
com-positions, denoted as DiffGP. In DiffGP, a Gaussian
process warps or flows an input x through an SDE sys-tem until a
predefined time T , resulting in x(T ), whichis subsequently
classified or regressed with a separatefunction. We apply the
process to both train and testinputs. We impose GP priors on both
the stochasticdifferential fields and the predictor function (See
Figure2). A key parameter of the differential GP model is theamount
of simulation time T , which defines the lengthof flow and the
capacity of the system, analogously tothe number of layers in
standard deep GPs or deepneural networks.
We assume a dataset of N inputs X = (x1, . . . ,xN )T ∈
RN×D of D-dimensional vectors xi ∈ RD, and associ-ated scalar
outputs y = (y1, . . . , yN )
T ∈ RN that canbe continuous for a regression problem or
categoricalfor classification, respectively. We redefine the
inputsas temporal functions x : T → RD over time such thatstate
paths xt over time t ∈ T = R+ emerge, wherethe observed inputs xi,t
, xi,0 correspond to initialstates xi,0 at time 0. We classify or
regress the finaldata points XT = (x1,T , . . . ,xN,T )
T after T time of anSDE flow with a predictor Gaussian
process
g(xT ) ∼ GP(0,K(xT ,x′T )) (12)
to classify or regress the outputs y. The frameworkreduces to a
conventional Gaussian process with zeroflow time T = 0 (See Figure
2).
The prediction depends on the final dataset XT struc-ture,
determined by the SDE flow dxt from the originaldata X. We consider
SDE flows of type
dxt = µ(xt)dt+√
Σ(xt)dWt (13)
where
µ(x) = KxZf K−1ZfZf
vec(Uf ) (14)
Σ(x) = Kxx −KxZf K−1ZfZf
KZfx (15)
-
Deep learning with differential Gaussian process flows
are the vector-valued Gaussian process conditioned oninducing
variable Uf = (u
f1 , . . . ,u
fM )
T defining func-
tion values f(z) at inducing states Zf = (zf1 , . . . , z
fM ).
These choices of drift and diffusion correspond to anunderlying
time-invariant GP
f ∼ GP(0,K(x,x′)) (16)f |Uf ,Zf ,x ∼ N (µ(x),Σ(x)) (17)
where K(x,x′) ∈ RD×D is a matrix-valued ker-nel of the vector
field f(x) ∈ RD, and KZfZf =(K(zfi , z
fj ))
Mi,j=1 ∈ RMD×MD block matrix of matrix-
valued kernels (similarly for KxZf ).
The vector field f(x) is now a GP with deterministicconditional
mean µ and covariance Σ at every locationx given the inducing
variables. We encode the underly-ing GP field mean and covariance
uncertainty into thedrift and diffusion of the SDE flow (13). The
Wienerprocess Wt of an SDE samples a new fluctuation fromthe
covariance Σ around the mean µ at every instantt. An affine
transformation of the GP field (17),
µ(x)∆t+ (f(x)− µ(x))√
∆t ∼ N (µ(x)∆t,Σ(x)∆t),(18)
shows that sampling from a GP vector field with thetemporal
discretisation of (18) matches the SDE Euler-Maruyama increment ∆xk
distribution (11). The statedistribution pT (x;µ,Σ,x0) can then be
representedas p(xT |Uf ) =
∫p(xT |f)p(f |Uf )df , where p(xT |f) is a
Dirac distribution of the end point of a single Euler-Maruyama
simulated path, and where the vector fieldp(f |Uf ) is marginalized
along the Euler-Maruyamapath.
Our model corresponds closely to the doubly-stochasticdeep GP,
where the Wiener process was replaced by ran-dom draws from the GP
posterior εl ·Σl(f l−1) per layerl (Salimbeni and Deisenroth,
2017). In our approachthe continuous time t corresponds to
continuously in-dexed states, effectively allowing infinite layers
that areinfinitesimal.
3.1 Spatio-temporal fields
Earlier we assumed a global, time-independent vec-tor field
f(xt), which in the standard models wouldcorrespond to a single
‘layer’ applied recurrently overtime t. To extend the model
capacity, we considerspatio-temporal vector fields ft(x) := f(x, t)
that them-selves evolve as a function of time, effectively
apply-ing a smoothly changing vector field ‘layer’ at everyinstant
t. We select a separable spatio-temporal ker-nel K((x, t), (x′,
t′)) = K(x,x′)k(t, t′) that leads toan efficient
Kronecker-factorised (Stegle et al., 2011)
spatio-temporal SDE flow
ft|Zsf ,Ztf ,Uf ,x ∼ N (µt(x),Σt(x)) (19)µt(x) = CxZf C
−1ZfZf
vec(Uf ) (20)
Σt(x) = Cxx −CxZf C−1ZfZf
CZfx, (21)
where Cxx = Kxxktt, CxZ = KxZsf ⊗ KtZtf andCZfZf = KZsfZsf
⊗KZtfZtf , and where the spatial induc-ing states are denoted by
Zsf and the temporal inducingtimes by Ztf . In practice we place
usually only a few(e.g. 3) temporal inducing times equidistantly on
therange [0, T ]. This allows the vector field itself to
curvesmoothly throughout the SDE. We only have a sin-gle inducing
matrix Uf for both spatial and temporaldimensions.
3.2 Stochastic variational inference
The differential Gaussian process is a combination ofa
conventional prediction GP g(·) with an SDE flowGP f(·) fully
parameterised by Z,U as well as kernelparameters θ. We turn to
variational inference toestimate posterior approximations q(Uf )
and q(ug) forboth models.
We start by augmenting the predictor function g withM inducing
locations Zg = (zg1, . . . , zgM ) with asso-ciated inducing
function values g(z) = u in a vectorug = (ug1, . . . , ugM )
T ∈ RM . We aim to learn thedistribution of the inducing values
u, while learningpoint estimates of the inducing locations Z, which
wehence omit from the notation below. The predictionconditional
distribution is (Titsias, 2009)
p(g|ug,XT ) = N (g|QTug,KXTXT −QTKZgZgQTT )(22)
p(ug) = N (ug|0,KZgZg ), (23)
where we denote QT = KXTZgK−1ZgZg
.
The joint density of a single path and prediction of
theaugmented system is
p(y,g,ug,XT , f ,Uf |X0) (24)= p(y|g)︸ ︷︷ ︸
likelihood
p(g|ug,XT )p(ug)︸ ︷︷ ︸GP prior of g(x)
p(XT |f ; X0)︸ ︷︷ ︸SDE
p(f |Uf )p(Uf )︸ ︷︷ ︸GP prior of f(x)
.
The joint distribution contains the likelihood term,the two GP
priors, and the SDE term p(XT |f ; X0)representing the
Euler-Maruyama paths of the dataset.Henceforth, we also omit
explicit conditioning of thestate distributions on the initial
states or the observeddataset p(XT |f) := p(XT |f ; X0). The
inducing vectorfield prior follows
p(Uf ) =
D∏d=1
N (ufd|0,KZf dZf d), (25)
-
Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, Samuel
Kaski
Figure 3: (a)Illustration of samples from a 2D deep Gaussian
processes prior. DGP prior exhibits a pathologywherein
representations in deeper layers concentrate on low-rank
manifolds.(b) Samples from a differentially deepGaussian processes
prior result in rank-preserving representations.(c) The
continuous-time nature of the warpingtrajectories results from
smooth drift and structured diffusion (d).
where ufd = (uf1 (d)
T , . . . ,ufM (d)) and Zf d =
(zf1 (d), . . . , zfM (d))
T .
We consider optimizing the marginal log likelihood
log p(y) = logEp(g|XT )p(XT )p(y|g), (26)
where the p(g|XT ) is a Gaussian process predictive
dis-tribution, and the state distribution p(XT ) marginal-izes the
trajectories,
p(XT ) =
∫∫p(XT |f)p(f |Uf )p(Uf )dfdUf , (27)
with no tractable solution.
We follow stochastic variational inference (SVI) byHensman et
al. (2015), where standard variational infer-ence (Blei et al.,
2016) is applied to find a lower boundof the marginal log
likelihood, or in other words modelevidence. In particular, a
variational lower bound forthe evidence (26) without the state
distributions has al-ready been considered by Hensman et al.
(2015), which
tackles both problems of cubic complexity O(N3)
andmarginalization of non-Gaussian likelihoods. We pro-pose to
include the state distributions by simulatingMonte Carlo state
trajectories.
We propose a complete variational posterior approxi-mation over
both f and g,
q(g,ug,XT , f ,Uf ) = p(g|ug,XT )q(ug) (28)· p(XT |f)p(f |Uf
)q(Uf )
q(ug) = N (ug|mg,Sg) (29)
q(Uf ) =
D∏d=1
N (ufd|mfd,Sfd), (30)
where Mf = (mf1, . . . ,mfD) and Sf = (Sf1, . . . ,SfD)collect
the dimension-wise inducing parameters. Wecontinue by marginalizing
out inducing variables ugand Uf from the above joint distribution,
arriving atthe joint variational posterior
q(g,XT , f) = q(g|XT )p(XT |f)q(f), (31)
-
Deep learning with differential Gaussian process flows
where
q(g|XT ) =∫p(g|ug,XT )q(ug)dug (32)
= N (g|QTmg,KXTXT + QT (Sg −KZgZg )QTT )(33)
q(f) =
∫p(f |Uf )q(Uf )dUf = N (f |µq,Σq) (34)
µq = Qfvec(Mf ) (35)
Σq = KXX + Qf (Sf −KZfZf )QTf , (36)
where Qf = KXZf K−1ZfZf
. We plug the derived varia-tional posterior drift µq and
diffusion Σq estimates tothe final variational SDE flow
dxt = µq(xt)dt+√
Σq(xt)dWt, (37)
which conveniently encodes the variational approxima-tion of the
vector field f .
Now the lower bound for our differential deep GP modelcan be
written as (detailed derivation is provided inthe appendix)
log p(y) ≥N∑i=1
{1
S
S∑s=1
Eq(g|x(s)i,T )
log p(yi|gi)︸ ︷︷ ︸variational expected likelihood
− kl[q(ug)||p(ug)]︸ ︷︷ ︸prior divergence of g(x)
− kl[q(Uf )||p(Uf )]︸ ︷︷ ︸prior divergence of f(x)
}, (38)
which factorises over both data and SDE paths with
unbiased samples x(s)i,T ∼ pT (x;µq,Σq,xi) by numeri-
cally solving the variational SDE (37) using the Euler-Maruyama
method.
For likelihoods such as Gaussian for regression prob-lems, we
can further marginalize g from the lowerboundas shown by Hensman et
al. (2013). For other in-tractable likelihoods, numerical
integration techniquessuch as Gauss-Hermite quadrature method can
be used(Hensman et al., 2015).
3.3 Rank pathologies in deep models
A deep Gaussian process fL(· · · f2(f1(x))) is a compo-sition of
L Gaussian process layers f l(x) (Damianouand Lawrence, 2013).
These models typically lead todegenerate covariances, where each
layer in the com-position reduces the rank or degrees of freedom
ofthe system (Duvenaud et al., 2014). In practice therank reduces
via successive layers mapping inputs toidentical values (See Figure
3a), effectively merginginputs and resulting in a reduced-rank
covariance ma-trix with repeated rows and columns. To counter
thispathology Salimbeni and Deisenroth (2017) proposed
pseudo-monotonic deep GPs by using identity meanfunction in all
intermediate GP layers.
Unlike the earlier approaches, our model does not seemto suffer
from this degeneracy. The DiffGP model warpsthe input space without
seeking low-volume representa-tions. In particular the SDE
diffusion scatters the tra-jectories preventing both narrow
manifolds and inputmerging. In practice, this results in a
rank-preservingmodel (See Figure 3b-d).
4 EXPERIMENTS
We optimize the inducing vectors, inducing locations,kernel
lengthscales and signal variance of both the SDEfunction f equation
(13) and the predictor functiong(xT ). We also optimize noise
variance in problemswith Gaussian likelihoods. The number of
inducingpoints M is manually chosen, where more inducingpoints
tightens the variational approximation at thecost of additional
computation. We train the modelend-to-end and all parameters are
jointly optimisedagainst the evidence lower bound (38). The
gradientsof the lower bound back-propagate through the pre-diction
function g(xT ) and through the SDE systemfrom x(T ) back to
initial values x(0). Gradients of anSDE system approximated by an
EM method can beobtained with the autodiff differentiation of
TensorFlow(Abadi et al., 2016). The gradients of
continuous-timesystems follow from forward or reverse mode
sensitivityequations (Kokotovic and Heller, 1967; Raue et al.,2013;
Fröhlich et al., 2017; Yildiz et al., 2018). We per-form
stochastic optimization with mini-batches and theAdam optimizer
(Kingma and Ba, 2014) with a stepsize of 0.01. For numerical
solutions of SDE, we useEuler-Maruyama solver with 20 time steps.
Also, ini-tializing parameters of g(·) with values learned
throughSGP results in early convergence; we initialize
DiffGPtraining with SGP results and a very weak warpingfield Uf ≈ 0
and kernel variance σ2f ≈ 0.01. We usediagonal approximation of the
Σq. We also use GPflow(Matthews et al., 2017), a Gaussian processes
frame-work built on TensorFlow in our implementation.
4.1 Step function estimation
We begin by highlighting how the DiffGP estimates asignal with
multiple highly non-stationary step func-tions. Figure 4 shows the
univariate signal observations(top), the learned SDE flow (middle),
and the resultingregression function on the end points X(t)
(bottom).The DiffGP separates the regions around the step func-tion
such that the final regression function g with astandard stationary
Gaussian kernel can fit the trans-formed data X(t). The model then
has learned thenon-stationarities of the system with uncertainty in
the
-
Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, Samuel
Kaski
boston energy concrete wine red kin8mn power naval protein
N 506 768 1,030 1,599 8,192 9,568 11,934 45,730D 13 8 8 22 8 4
26 9
Linear 4.24(0.16) 2.88(0.05) 10.54(0.13) 0.65(0.01) 0.20(0.00)
4.51(0.03) 0.01(0.00) 5.21(0.02)
BNN L = 2 3.01(0.18) 1.80(0.05) 5.67(0.09) 0.64(0.01) 0.10(0.00)
4.12(0.03) 0.01(0.00) 4.73(0.01)
Sparse GPM = 100 2.87(0.15) 0.78(0.02) 5.97(0.11) 0.63(0.01)
0.09(0.00) 3.91(0.03) 0.00(0.00) 4.43(0.03)M = 500 2.73(0.12)
0.47(0.02) 5.53(0.12) 0.62(0.01) 0.08(0.00) 3.79(0.03) 0.00(0.00)
4.10(0.03)
Deep GPM = 100
L = 2 2.90(0.17) 0.47(0.01) 5.61(0.10) 0.63(0.01) 0.06(0.00)
3.79(0.03) 0.00(0.00) 4.00(0.03)L = 3 2.93(0.16) 0.48(0.01)
5.64(0.10) 0.63(0.01) 0.06(0.00) 3.73(0.04) 0.00(0.00) 3.81(0.04)L
= 4 2.90(0.15) 0.48(0.01) 5.68(0.10) 0.63(0.01) 0.06(0.00)
3.71(0.04) 0.00(0.00) 3.74(0.04)L = 5 2.92(0.17) 0.47(0.01)
5.65(0.10) 0.63(0.01) 0.06(0.00) 3.68(0.03) 0.00(0.00)
3.72(0.04)
DiffGPM = 100
T = 1.0 2.80(0.13) 0.49(0.02) 5.32(0.10) 0.63(0.01) 0.06(0.00)
3.76(0.03) 0.00(0.00) 4.04(0.04)T = 2.0 2.68(0.10) 0.48(0.02)
4.96(0.09) 0.63(0.01) 0.06(0.00) 3.72(0.03) 0.00(0.00) 4.00(0.04)T
= 3.0 2.69(0.14) 0.47(0.02) 4.76(0.12) 0.63(0.01) 0.06(0.00)
3.68(0.03) 0.00(0.00) 3.92(0.04)T = 4.0 2.67(0.13) 0.49(0.02)
4.65(0.12) 0.63(0.01) 0.06(0.00) 3.66(0.03) 0.00(0.00) 3.89(0.04)T
= 5.0 2.58(0.12) 0.50(0.02) 4.56(0.12) 0.63(0.01) 0.06(0.00)
3.65(0.03) 0.00(0.00) 3.87(0.04)
Table 1: Test RMSE values of 8 benchmark datasets (reproduced
from Salimbeni & Deisenroth 2017). Usesrandom 90% / 10%
training and test splits, repeated 20 times.
signals being modelled by the inherent uncertaintiesarising from
the diffusion.
4.2 UCI regression benchmarks
We compare our model on 8 regression benchmarkswith the
previously reported state-of-the-art resultsin (Salimbeni and
Deisenroth, 2017). We test all thedatasets on different flow time
values from 1 to 5. Weuse the RBF kernel with ARD and 100 inducing
pointsfor both the differential Gaussian process and the
re-gression Gaussian process. Each experiment is repeated20 times
with random 90% / 10% training and testsplits. While testing, we
compute predictive mean andpredictive variance for each of the
sample generatedfrom (37), and compute the average of summary
statis-tics (RMSE and log likelihood) over these samples. Themean
and standard error of RMSE values are reportedin Table 1.
On Boston, Concrete and Power datasets, where deepmodels show
improvement over shallow models, ourmodel outperforms previous best
results of DGP. Thereis a small improvement by having a non-linear
model onthe Kin8mn dataset and our results match that of DGP.Energy
and Wine are small datasets where single Gaus-sian processes
perform the best. As expected, bothDiffGP and DGP recover the
shallow model indicatingno over-fitting. Regression task on the
Protein datasetis aimed at predicting RMSD (Root Mean Squared
De-viation) between modeled and native protein structuresusing 9
different properties of the modeled structures(Rana et al., 2015).
We suspect DGP particularly per-forms better than DiffGP in the
task because of itscapability to model long-range correlations.
4.3 UCI classification benchmarks
We perform binary classification experiments on large-scale
HIGGS and SUSY datasets with a data size in theorder of millions.
We use the AUC as the performancemeasure and compare the results
with the previouslyreported results using DGP (Salimbeni and
Deisenroth,2017) and DNN (Baldi et al., 2014). The
classificationtask involves identifying processes that produce
Higgsboson and super-symmetric particles using data fromMonte Carlo
simulations. Previously, deep learningmethods based on neural
networks have shown promis-ing results on these tasks (Baldi et
al., 2014). On theHIGGS dataset, the proposed DiffGP model
showsstate-of-the-art (0.878) results, equal or even betterthan the
earlier reported results using DGPs (0.877)and DNNs (0.876). On the
SUSY dataset, we reach theperformance of 4-hidden layer DGP (0.841)
with non-temporal DiffGP (0.842). Considering the
consistentimprovement in the performance of DGP models
withadditional layers, we tried increasing the capacity ofDiffGP
model using the temporal extension proposedin Section 3.1. In
particular, we used 100 spatial in-ducing vectors along with 3
temporal inducing vectors.The temporal DiffGP model gives an AUC of
0.878on HIGGS and 0.846 on SUSY datasets matching thebest reported
results of DGP (see appendix for detailedcomparison).
4.4 Importance of flow time
In this we experiment we study the SDE flow timeparameter on
Concrete dataset. Increasing integrationtime provides more warping
flexibility to the SDE com-ponent. That is, with increase in the
flow time, the
-
Deep learning with differential Gaussian process flows
Figure 4: Step function estimation: Observed input space (a) is
transformed through stochastic continuous-timemappings (b) into a
warped space (c). The stationary Gaussian process in the warped
space gives a smoothpredictive distribution corresponding to a
highly non-stationary predictions in the original observed
space.
Figure 5: Concrete dataset: increasing the flow timevariable T
improves the train and test errors (a,c) andlikelihoods (b,d). The
horizontal line indicates GP andDGP2 performance. The model
convergence indicatesthe improved capacity upon increased flow time
(e).
SDE system can move observations further away fromthe initial
state, however at the cost of exposing thestate to more diffusion
which acts as regularization.Hence increasing time can lead to an
increase in themodel capacity without over-fitting. We
empiricallysupport this claim in the current experiment by fit-ting
a regression model multiple times and maintainingsame experimental
setup, expect for the flow time. Fig-ure 5 shows the variation in
RMSE, log likelihood andthe lower bound on marginal likelihood
across differentflow times. It can be seen that the improvement in
theperformance almost saturates near time = 10.
5 DISCUSSION
We have proposed a novel continuous-time deep learn-ing approach
with Gaussian processes. The proposeddeferentially deep composition
is a continuous-time ap-proach wherein a Gaussian processes input
locationsare warped through stochastic and smooth differen-tial
equations. This results in a principled Bayesianapproach with a
smooth non-linear warping; the uncer-tainty through diffusion acts
as a key regularizer.
We empirically show excellent results in various regres-sion and
classification tasks. Also, DGP with the modelspecification as
proposed by Salimbeni and Deisenroth(2017), uses a total of O(LDM)
number of inducingparameters for the regression results, where L is
thenumber of layers, D is the input dimension, M is thenumber of
inducing points for each latent GP. In con-trast, with a smaller
number of inducing parametersO(DM), we arrive at similar or even
better results.
The continuous-time deep model admits ‘decision-making paths’,
where we can explicitly follow the trans-formation applied to a
data point xi. Analyzing thesepaths could lead to a better
interpretable model. How-ever, modeling in the input space without
intermediatelow-dimensional latent representations presents
scala-bility issues. We leave scaling the approach to
highdimensions as future work, while we also intend toexplore new
optimisation modes, such as SG-MCMC(Ma et al., 2015) or Stein
inference (Liu and Wang,2016) in the future.
-
Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, Samuel
Kaski
Acknowledgments
We acknowledge the computational resources providedby the Aalto
Science-IT. This work has been supportedby the Academy of Finland
grants no. 299915, 319264,313195, 294238.
References
Mart́ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy
Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey
Irving, Michael Isard,et al. Tensorflow: A system for large-scale
machinelearning. In OSDI, volume 16, pages 265–283, 2016.
Pierre Baldi, Peter Sadowski, and Daniel Whiteson.Searching for
exotic particles in high-energy physicswith deep learning. Nature
Communications, 5:4308,2014.
D. Blei, A. Kucukelbir, and J. McAuliffe. Variationalinference:
A review for statisticians. Journal of theAmerican Statistical
Association, 112:859–877, 2016.
Andreas Damianou and Neil Lawrence. Deep gaussianprocesses. In
Artificial Intelligence and Statistics,pages 207–215, 2013.
David Duvenaud, Oren Rippel, Ryan Adams, andZoubin Ghahramani.
Avoiding pathologies in verydeep networks. In Artificial
Intelligence and Statis-tics, pages 202–210, 2014.
David K Duvenaud, Hannes Nickisch, and Carl E Ras-mussen.
Additive gaussian processes. In Advancesin Neural Information
Processing Systems, pages226–234, 2011.
Rudolf Friedrich, Joachim Peinke, Muhammad Sahimi,and M Reza
Rahimi Tabar. Approaching complexityby stochastic methods: From
biological systems toturbulence. Physics Reports, 506(5):87–162,
2011.
Fabian Fröhlich, Barbara Kaltenbacher, Fabian J.Theis, and Jan
Hasenauer. Scalable parameter es-timation for genome-scale
biochemical reaction net-works. PLOS Computational Biology,
13(1):1–18, 012017. doi: 10.1371/journal.pcbi.1005331.
C. Garćıa, A. Otero, P. Felix, J. Presedo, and D. Mar-quez.
Nonparametric estimation of stochastic dif-ferential equations with
sparse Gaussian processes.Physical Review E, 96(2):022104,
2017.
M. Heinonen, H. Mannerström, J. Rousu, S. Kaski,and H.
Lähdesmäki. Non-stationary Gaussian pro-cess regression with
Hamiltonian Monte Carlo. InAISTATS, volume 51, pages 732–740,
2016.
J. Hensman, N. Fusi, and N. Lawrence. Gaussianprocesses for big
data. In Proceedings of the Twenty-Ninth Conference on Uncertainty
in Artificial Intel-ligence, pages 282–290. AUAI Press, 2013.
J. Hensman, A. Matthews, and Z. Ghahramani. Scal-able
variational Gaussian process classification. InArtificial
Intelligence and Statistics, pages 351–360,2015.
Desmond Higham. An algorithmic introduction to nu-merical
simulation of stochastic differential equations.SIAM Rev.,
43:525–546, 2001.
Diederik P Kingma and Jimmy Lei Ba. Adam:Amethod for stochastic
optimization. In Proc. 3rdInt. Conf. Learn. Representations,
2014.
P.E. Kloeden and E. Platen. Numerical Solutionof Stochastic
Differential Equations. Applicationsof Mathematics.
Springer-Verlag, 1992. ISBN9783540540625.
P Kokotovic and J Heller. Direct and adjoint sensi-tivity
equations for parameter optimization. IEEETransactions on Automatic
Control, 12(5):609–610,1967.
H Lamba, Jonathan C Mattingly, and Andrew M Stu-art. An adaptive
euler–maruyama scheme for sdes:convergence and stability. IMA
journal of numericalanalysis, 27:479–506, 2006.
Miguel Lázaro-Gredilla. Bayesian warped gaussian pro-cesses. In
Advances in Neural Information ProcessingSystems, pages 1619–1627,
2012.
Qiang Liu and Dilin Wang. Stein variational gradientdescent: A
general purpose bayesian inference algo-rithm. In Advances in
Neural Information ProcessingSystems, pages 2378–2386, 2016.
Yi-An Ma, Tianqi Chen, and Emily Fox. A completerecipe for
stochastic gradient mcmc. In Advancesin Neural Information
Processing Systems, pages2917–2925, 2015.
Alexander G. de G. Matthews, Mark van der Wilk, TomNickson,
Keisuke. Fujii, Alexis Boukouvalas, PabloLeón-Villagrá, Zoubin
Ghahramani, and James Hens-man. GPflow: A Gaussian process library
using Ten-sorFlow. Journal of Machine Learning Research,
18(40):1–6, apr 2017. URL
http://jmlr.org/papers/v18/16-537.html.
B. Oksendal. Stochastic Differential Equations: AnIntroduction
with Applications. Springer, 6th edition,2014.
Omiros Papaspiliopoulos, Yvo Pokern, Gareth ORoberts, and Andrew
M Stuart. Nonparametricestimation of diffusions: a differential
equations ap-proach. Biometrika, 99(3):511–531, 2012.
Prashant Singh Rana, Harish Sharma, Mahua Bhat-tacharya, and
Anupam Shukla. Quality assessmentof modeled protein structure using
physicochemicalproperties. Journal of bioinformatics and
computa-tional biology, 13(02):1550005, 2015.
http://jmlr.org/papers/v18/16-537.htmlhttp://jmlr.org/papers/v18/16-537.html
-
Deep learning with differential Gaussian process flows
C.E. Rasmussen and K.I. Williams. Gaussian processesfor machine
learning. MIT Press, 2006.
Andreas Raue, Marcel Schilling, Julie Bachmann,Andrew Matteson,
Max Schelker, Daniel Kaschek,Sabine Hug, Clemens Kreutz, Brian D.
Harms,Fabian J. Theis, Ursula Klingmüller, and Jens Tim-mer.
Lessons learned from quantitative dynamicalmodeling in systems
biology. PLOS ONE, 8(9):1–17,2013.
S. Remes, M. Heinonen, and S. Kaski. Non-stationaryspectral
kernels. Advances in Neural InformationProcessing Systems,
2017.
Hugh Salimbeni and Marc Deisenroth. Doubly stochas-tic
variational inference for deep gaussian processes.In Advances in
Neural Information Processing Sys-tems, pages 4591–4602, 2017.
Edward Snelson and Zoubin Ghahramani. Sparse gaus-sian processes
using pseudo-inputs. In Advances inNeural Information Processing
Systems, pages 1257–1264, 2006.
Edward Snelson, Zoubin Ghahramani, and Carl E Ras-mussen. Warped
gaussian processes. In Advancesin Neural Information Processing
Systems, pages337–344, 2004.
Jasper Snoek, Kevin Swersky, Rich Zemel, and RyanAdams. Input
warping for bayesian optimization ofnon-stationary functions. In
International Confer-ence on Machine Learning, pages 1674–1682,
2014.
Oliver Stegle, Christoph Lippert, Joris M Mooij, Neil DLawrence,
and Karsten M Borgwardt. Efficient in-ference in matrix-variate
gaussian models with iidobservation noise. In Advances in neural
informationprocessing systems, pages 630–638, 2011.
S. Sun, G. Zhang, C. Wang, W. Zeng, J. Li, andR. Grosse.
Differentiable compositional kernel learn-ing for gaussian
processes. In International Confer-ence on Machine Learning,
2018.
M. Titsias. Variational learning of inducing variables insparse
Gaussian processes. In Artificial Intelligenceand Statistics, pages
567–574, 2009.
Ville Tolvanen, Pasi Jylänki, and Aki Vehtari. Expec-tation
propagation for nonstationary heteroscedasticgaussian process
regression. In Machine Learning forSignal Processing (MLSP), 2014
IEEE InternationalWorkshop on, pages 1–6. IEEE, 2014.
A. Wilson, E. Gilboa, A. Nehorai, and J. Cunning-ham. Fast
multidimensional pattern extrapolationwith gaussian processes.
Artificial Intelligence andStatistics, 2013.
Andrew Gordon Wilson, David A Knowles, and ZoubinGhahramani.
Gaussian process regression networks.In Proceedings of the 29th
International Coference
on International Conference on Machine Learning,pages 1139–1146.
Omnipress, 2012.
Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhut-dinov, and
Eric P Xing. Deep kernel learning. InArtificial Intelligence and
Statistics, pages 370–378,2016.
Cagatay Yildiz, Markus Heinonen, Jukka Intosalmi,Henrik
Mannerström, and Harri Lähdesmäki. Learn-ing stochastic
differential equations with gaussianprocesses without gradient
matching. In MachineLearning in Signal Processing, 2018.
INTRODUCTIONBACKGROUNDGaussian processesStochastic differential
equations
DEEP DIFFERENTIAL GAUSSIAN PROCESSSpatio-temporal
fieldsStochastic variational inferenceRank pathologies in deep
models
EXPERIMENTSStep function estimationUCI regression benchmarksUCI
classification benchmarksImportance of flow time
DISCUSSION