Vecchia-Laplace Approximations for Generalized Gaussian ......4 Conclusions Matthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 2 / 27 Gaussian processes

Vecchia-Laplace Approximations for GeneralizedGaussian Processes

Matthias Katzfuss

Department of StatisticsTexas A&M University

Joint work with Daniel Zilber

September 28, 2018

Matthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 1 / 27

Outline

Outline

1 Gaussian processes

2 The general Vecchia frameworkOverview and basic ideaGaussian noise

3 Vecchia-Laplace approximations for generalized GPs

4 Conclusions


Gaussian processes

Outline




4 Conclusions


Gaussian processes

Function estimation

Consider a function , observed incompletely , and with noise/error


Gaussian processes

Function estimation

●

●●

●

●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●●●

●●

●

●●

●

●●

●●

●

●

●

●●●

●

●

●

●

●

●

●●



Gaussian processes

Function estimation

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●



Gaussian processes

Gaussian processes (GPs): Probabilistic function estimators

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

GPs provide an optimal function estimate under the assumption of aninfinite-dimensional normal distributionand quantify uncertainty in the form of a joint probability distribution


Gaussian processes


●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●



Gaussian processes


●

●●

●

●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●●●

●●

●

●●

●

●●

●●

●

●

●

●●●

●

●

●

●

●

●

●●



Gaussian processes

Challenge: GPs are not scalable

For n data points, need to work with n × n covariance matrix→ Direct inference has O(n3) time and O(n2) memory complexity

0 20 40 60 80 100

010

2030

4050

n (thousands)

hour

scubic

Want methods/approximations that scale linearly in nMatthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 6 / 27

Gaussian processes

Challenge: GPs are not scalable

For n data points, need to work with n × n covariance matrix→ Direct inference has O(n3) time and O(n2) memory complexity

0 20 40 60 80 100

010

2030

4050

n (thousands)

hour

scubiclinear

Want methods/approximations that scale linearly in nMatthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 6 / 27

Gaussian processes

Existing approaches for computational feasibility

Let Σ be the n × n data covariance matrix.

Existing approaches include:

• Sparse Σ (e.g., Furrer et al., 2006; Kaufman et al., 2008)

• Sparse Σ−1 (e.g., Rue and Held, 2005; Lindgren et al., 2011; Nychkaet al., 2015)

• Low-rank Σ (e.g., Higdon, 1998; Wikle and Cressie, 1999;Quinonero-Candela and Rasmussen, 2005; Banerjee et al., 2008;Cressie and Johannesson, 2008)

• Sparse Cholesky factor of Σ−1: Vecchia


The general Vecchia framework

Outline




4 Conclusions


The general Vecchia framework Overview and basic idea

Outline




4 Conclusions



Vecchia approximation

Assume x = (x1, . . . , xN) is multivariate Gaussian. Density function can befactorized as

f (x) =N∏i=1

f (xi |x1:i−1),

where x1:i−1 := (x1, . . . , xi−1).

This factorization motivates the Vecchia (1988) approximation:

f (x) =N∏i=1

f (xi |xg(i)),

where g(i) ⊂ {1, . . . , i − 1} is the conditioning set of size |g(i)| ≤ m.

If screening effect holds, can get good approximation for m� N, whichcan lead to enormous computational savings.





f (x) =N∏i=1

f (xi |x1:i−1),

where x1:i−1 := (x1, . . . , xi−1).


f (x) =N∏i=1

f (xi |xg(i)),







f (x) =N∏i=1

f (xi |x1:i−1),

where x1:i−1 := (x1, . . . , xi−1).


f (x) =N∏i=1

f (xi |xg(i)),





General Vecchia framework (Katzfuss and Guinness, 2017)

If screening is weak for data/responses, might be able to includeunobserved (latent) variables in x for stronger screening effect.

Three choices for general Vecchia:

1. Which variables to include in x?

2. How to order the variables in x?

3. How to choose the conditioning sets g(i)?

Choices can result in tremendous differences in terms of approximationaccuracy and computational speed.



Special cases

• Standard/response Vecchia: Noisy responses in x (e.g., Vecchia,1988; Stein et al., 2004)

• Nearest-neighbor GP: Latent GP in x (Datta et al., 2016)

• Multi-resolution approximation (Katzfuss, 2017):Ordering/conditioning based on iterative domain partitioning

• Special cases: Modified predictive process (Finley et al., 2009),FSA-block (Snelson and Ghahramani, 2007; Sang et al., 2011)



Ordering and conditioning

Coordinate

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

●●

●●●

●●

●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ●

Max-min distance

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

● ●

●

●

● ●

●

●

● ●

●

●

● ●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

MRA

● ●

● ●

● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ●

• Max-min distance ordering can be much more accurate thancoordinate ordering (Guinness, 2018)

• Conditioning usually on m nearest previously ordered locations, butmore complicated schemes possible


The general Vecchia framework Gaussian noise

Outline




4 Conclusions



Response Vecchia (Vecchia, 1988)

Vecchia approximation is applied to data/responses directly

Exact GP vs. response Vecchia with m = 4

●

●●

●

●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●●●

●●

●

●●

●

●●

●●

●

●

●

●●●

●

●

●

●

●

●

●●

Works well for data without noiseWorks very poorly if data are noisyMatthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 15 / 27





●

●●

●

●

●

●

●●

●

●

●

●● ●

●●

●

●

●

●

●

●

●●●

●●

●

●●

●

●●

●●

●

●

●

●●●

●

●

●

●

●

●

●●






●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●



General Vecchia (Katzfuss and Guinness, 2017)

x consists of noisy data and (unknown/latent) GP realizations

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

Works well even if data are noisy (m = 4)



Specific methods

• Sparse general Vecchia (SGV) approximations to the likelihood: xcontains noisy data and latent GP realizations (Katzfuss andGuinness, 2017)

• SGV predictions: x contains noisy data, latent GP realizations, andGP at prediction locations (Katzfuss et al., 2018)

• Multi-level Vecchia: Automatically tailors different approximations todifferent scales (Zhang & Katzfuss, in prep.)

Crucial details:

• Include variables and order them such that strong screening effectholds

• Integrate out latent/nuisance variables• Study sparsity and guarantee linear scaling using directed acyclic

graph (DAG) resultsMatthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 17 / 27


Specific methods

• Sparse general Vecchia (SGV) approximations to the likelihood: xcontains noisy data and latent GP realizations (Katzfuss andGuinness, 2017)

• SGV predictions: x contains noisy data, latent GP realizations, andGP at prediction locations (Katzfuss et al., 2018)

• Multi-level Vecchia: Automatically tailors different approximations todifferent scales (Zhang & Katzfuss, in prep.)

Crucial details:

• Include variables and order them such that strong screening effectholds

• Integrate out latent/nuisance variables• Study sparsity and guarantee linear scaling using directed acyclic

graph (DAG) resultsMatthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 17 / 27

Vecchia-Laplace approximations for generalized GPs

Outline




4 Conclusions



Non-Gaussian spatial data: Generalized GP (GGP)

Conditional on GP, data are independent from exponential family:binary, categorical, counts, . . .

Example: Binary classification using logistic GGP:• Take GP function• Transform into probability using logistic link, then draw from

Bernoulli distribution







−2

−1

01

2







−2

−1

01

2

0.0

0.2

0.4

0.6

0.8

1.0

●●

●

● ●

● ●

●●

●●● ●●●●●

●●

● ●

●●●●●

●

●

● ●

●●●●●●●

● ● ●●●

●

●

●

●

●

●

●●



Log-Gaussian Cox process (LGCP) as a GGP

• Poisson process with random intensity function λ(·) = ey(·), wherey(·) ∼ GP(µ,C )

• Approximation as GGP with Poisson likelihood:• Partition domain into grid cells A1, . . . ,An with center points a1, . . . , an• Data zi = z(Ai ): number of observed points in Ai

• Then, z1, . . . , zn|y(·) ind.∼ Poisson(µ(Ai )), where µ(Ai ) ≈ |Ai | ey(ai )

Latent

−3

−2

−1

0

1

2

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Observed Down−Sampled

12



Laplace for non-Gaussian data

For generalized GP, posterior is intractable → 2nd-order Taylor expansionat the mode (Laplace approximation).

Laplace algorithm: Iterative GP prediction using Gaussian pseudo-data

0.0

0.2

0.4

0.6

0.8

1.0

●●

●

● ●

● ●

●●

●●● ●●●●●

●●

● ●

●●●●●

●

●

● ●

●●●●●●●

● ● ●●●

●

●

●

●

●

●

●●

But still O(n3) → infeasible for large n






0.0

0.2

0.4

0.6

0.8

1.0

●●

●

● ●

● ●

●●

●●● ●●●●●

●●

● ●

●●●●●

●

●

● ●

●●●●●●●

● ● ●●●

●

●

●

●

●

●

●●

−2

−1

01

23

●●

●

●

●

● ●

●●

●

●

●●●

●●●

●●

●●

●●●●●

●

●

● ●

●●●●●●

●

●●

●●●

●

●

●

●

●

●

●●







0.0

0.2

0.4

0.6

0.8

1.0

●●

●

● ●

● ●

●●

●●● ●●●●●

●●

● ●

●●●●●

●

●

● ●

●●●●●●●

● ● ●●●

●

●

●

●

●

●

●●

−2

−1

01

23

●●

●

●

●

● ●

●●

●

●

●●●

●●●

●●

●●

●●●●●

●

●

● ●

●●●●●●

●

●●

●●●

●

●

●

●

●

●

●●




Vecchia-Laplace (Katzfuss & Zilber, in prep.)

Based on pseudo-data t, apply Vecchia to x = (t1, . . . , tn, y1, . . . , yn)

Comparison of MSE relative to Laplace:



Vecchia-Laplace (Katzfuss & Zilber, in prep.)

Based on pseudo-data t, apply Vecchia to x = (t1, . . . , tn, y1, . . . , yn)

Comparison of MSE relative to Laplace:

0 5 10 15 20 25 30

1.00

1.05

1.10

1.15

1.20

ν = 0.5 (Logistic)

Conditioning Set Size

MS

E

●

●

●

●

●

●

● ●

0 5 10 15 20 25 30

1.00

1.05

1.10

1.15

1.20

ν = 0.5 (Poisson)


MS

E

●

●

●

●

●

●● ●

0 5 10 15 20 25 30

1.00

1.05

1.10

1.15

1.20

1.25

ν = 0.5 (Gamma)


MS

E

●

●

●

●

●

●● ●

● VL Laplace Low Rank



Comparison for simulated LGCP

Matern covariance with range 2.5 and smoothness ν on domainD = [0, 50]2, discretized to n = 2,500 = 50× 50 grid

RMSE

ν = 0.5 ν = 1.5

0 20 40 60 0 20 40 60

1.2

1.5

1.8

1.00

1.05

1.10

1.15

1.20

1.25

m

RM

SE

Algorithm Laplace LowRank VL



Comparison for simulated LGCP

Matern covariance with range 2.5 and smoothness ν on domainD = [0, 50]2, discretized to n = 2,500 = 50× 50 grid

KL divergence

ν = 0.5 ν = 1.5

0 20 40 60 0 20 40 60

0

1000

2000

3000

0

200

400

600

800

m

LS

Algorithm Laplace LowRank VLMatthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 24 / 27


Vecchia-Laplace algorithm

• Complexity is linear in n

• Typically converges in < 10 iterations

• Unknown hyperparameters: Approximate the integrated likelihood

• Predictions at unobserved locations


Conclusions

Outline




4 Conclusions


Conclusions

Conclusions

• General Vecchia framework for GP approximations:• Accurate• Choice of x, ordering, and conditioning all important• Can guarantee linear scalability• Extension to GGPs and LGCPs

• R package: https://github.com/katzfuss-group/GPvecchia

• Papers:• Katzfuss, M. and Guinness, J. (2017). A general framework for Vecchia

approximations of Gaussian processes. arXiv:1708.06302.• Katzfuss, M., Guinness, J., and Gong, W. (2018). Vecchia

approximations of Gaussian-process predictions. arXiv:1805.03309.• Zilber, D. and Katzfuss, M. (in prep.) A Vecchia-Laplace

approximation for generalized Gaussian processes.

• Partially supported by NSF Grants DMS–1654083 and DMS–1521676

Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussianpredictive process models for large spatial data sets. Journal of theRoyal Statistical Society, Series B, 70(4):825–848.

Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very largespatial data sets. Journal of the Royal Statistical Society, Series B,70(1):209–226.

Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016).Hierarchical nearest-neighbor Gaussian process models for largegeostatistical datasets. Journal of the American Statistical Association,111(514):800–812.

Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009).Improving the performance of predictive process modeling for largedatasets. Computational Statistics & Data Analysis, 53(8):2873–2884.

Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering forinterpolation of large spatial datasets. Journal of Computational andGraphical Statistics, 15(3):502–523.

Guinness, J. (2018). Permutation and grouping methods for sharpeningGaussian process approximations. Technometrics.

Higdon, D. (1998). A process-convolution approach to modellingtemperatures in the North Atlantic Ocean. Environmental andEcological Statistics, 5(2):173–190.

Katzfuss, M. (2017). A multi-resolution approximation for massive spatialdatasets. Journal of the American Statistical Association,112(517):201–214.

Katzfuss, M. and Guinness, J. (2017). A general framework for Vecchiaapproximations of Gaussian processes. arXiv:1708.06302.

Katzfuss, M., Guinness, J., and Gong, W. (2018). Vecchia approximationsof Gaussian-process predictions. arXiv:1805.03309.

Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariancetapering for likelihood-based estimation in large spatial data sets.Journal of the American Statistical Association, 103(484):1545–1555.

Lindgren, F., Rue, H., and Lindstrom, J. (2011). An explicit link betweengaussian fields and gaussian markov random fields: the stochastic partialdifferential equation approach. Journal of the Royal Statistical Society:Series B, 73(4):423–498.

Nychka, D. W., Bandyopadhyay, S., Hammerling, D., Lindgren, F., andSain, S. R. (2015). A multi-resolution Gaussian process model for theanalysis of large spatial data sets. Journal of Computational andGraphical Statistics, 24(2):579–599.

Quinonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view ofsparse approximate Gaussian process regression. Journal of MachineLearning Research, 6:1939–1959.

Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theoryand Applications. CRC press.

Sang, H., Jun, M., and Huang, J. Z. (2011). Covariance approximation forlarge multivariate spatial datasets with an application to multipleclimate model errors. Annals of Applied Statistics, 5(4):2519–2548.

Snelson, E. and Ghahramani, Z. (2007). Local and global sparse Gaussianprocess approximations. In Artificial Intelligence and Statistics 11(AISTATS).

Stein, M. L., Chi, Z., and Welty, L. (2004). Approximating likelihoods forlarge spatial data sets. Journal of the Royal Statistical Society: SeriesB, 66(2):275–296.

Vecchia, A. (1988). Estimation and model identification for continuousspatial processes. Journal of the Royal Statistical Society, Series B,50(2):297–312.

Wikle, C. K. and Cressie, N. (1999). A dimension-reduced approach tospace-time Kalman filtering. Biometrika, 86(4):815–829.


https://github.com/katzfuss-group/GPvecchia

Vecchia-Laplace Approximations for Generalized Gaussian ......4 Conclusions Matthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 2 / 27 Gaussian processes

Documents