Top Banner
Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Feb 10, 2016 Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 1
85

Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Jun 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Processes for Nonlinear Regression andNonlinear Dimensionality Reduction

Piyush RaiIIT Kanpur

Probabilistic Machine Learning (CS772A)

Feb 10, 2016

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 1

Page 2: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

A Gaussian Process (GP) is a distribution over functions

A random draw from a GP thus gives a function f

f ∼ GP(µ, κ)

where µ is the mean function and κ is the covariance/kernel function (thecov. function controls f ’s shape/smoothness)

Note: µ and κ can be chosen or learned from data

GP can be used as a nonparametric prior distribution for such functions

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 2

Page 3: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

A Gaussian Process (GP) is a distribution over functions

A random draw from a GP thus gives a function f

f ∼ GP(µ, κ)

where µ is the mean function and κ is the covariance/kernel function (thecov. function controls f ’s shape/smoothness)

Note: µ and κ can be chosen or learned from data

GP can be used as a nonparametric prior distribution for such functions

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 2

Page 4: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

A Gaussian Process (GP) is a distribution over functions

A random draw from a GP thus gives a function f

f ∼ GP(µ, κ)

where µ is the mean function and κ is the covariance/kernel function (thecov. function controls f ’s shape/smoothness)

Note: µ and κ can be chosen or learned from data

GP can be used as a nonparametric prior distribution for such functions

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 2

Page 5: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

A function f is said to be drawn from GP(µ, κ) if

f (x1)f (x2)

...f (xN)

∼ Nµ(x1)µ(x2)

...µ(xN)

,

κ(x1, x1) . . . κ(x1, xN)κ(x2, x1) . . . κ(x2, xN)

.... . .

...κ(xN , x1) . . . κ(xN , xN)

Thus, if f is drawn from a GP then the joint distribution of f ’s evaluations ata finite set of points {x1, x2, . . . , xN} is a multivariate normal

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 3

Page 6: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

A function f is said to be drawn from GP(µ, κ) if

f (x1)f (x2)

...f (xN)

∼ Nµ(x1)µ(x2)

...µ(xN)

,

κ(x1, x1) . . . κ(x1, xN)κ(x2, x1) . . . κ(x2, xN)

.... . .

...κ(xN , x1) . . . κ(xN , xN)

Thus, if f is drawn from a GP then the joint distribution of f ’s evaluations ata finite set of points {x1, x2, . . . , xN} is a multivariate normal

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 3

Page 7: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

Let’s define

f =

f (x1)f (x2)

...f (xN)

,µ =

µ(x1)µ(x2)

...µ(xN)

,K =

κ(x1, x1) . . . κ(x1, xN)κ(x2, x1) . . . κ(x2, xN)

.... . .

...κ(xN , x1) . . . κ(xN , xN)

Note: K is also called the kernel matrix. Knm = κ(xn, xm)

Thus we havef ∼ N (µ,K)

Often, we assume the mean function to be zero. Thus f ∼ N (0,K)

Covariance/kernel function κ measures similarity between two inputs

κ(xn, xm) = exp(− ||xn−xm||2

γ

): RBF kernel

κ(xn, xm) = v0 exp{−(|xn−xm|

r

)α}+ v1 + v2δnm

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 4

Page 8: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

Let’s define

f =

f (x1)f (x2)

...f (xN)

,µ =

µ(x1)µ(x2)

...µ(xN)

,K =

κ(x1, x1) . . . κ(x1, xN)κ(x2, x1) . . . κ(x2, xN)

.... . .

...κ(xN , x1) . . . κ(xN , xN)

Note: K is also called the kernel matrix. Knm = κ(xn, xm)

Thus we havef ∼ N (µ,K)

Often, we assume the mean function to be zero. Thus f ∼ N (0,K)

Covariance/kernel function κ measures similarity between two inputs

κ(xn, xm) = exp(− ||xn−xm||2

γ

): RBF kernel

κ(xn, xm) = v0 exp{−(|xn−xm|

r

)α}+ v1 + v2δnm

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 4

Page 9: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

Let’s define

f =

f (x1)f (x2)

...f (xN)

,µ =

µ(x1)µ(x2)

...µ(xN)

,K =

κ(x1, x1) . . . κ(x1, xN)κ(x2, x1) . . . κ(x2, xN)

.... . .

...κ(xN , x1) . . . κ(xN , xN)

Note: K is also called the kernel matrix. Knm = κ(xn, xm)

Thus we havef ∼ N (µ,K)

Often, we assume the mean function to be zero. Thus f ∼ N (0,K)

Covariance/kernel function κ measures similarity between two inputs

κ(xn, xm) = exp(− ||xn−xm||2

γ

): RBF kernel

κ(xn, xm) = v0 exp{−(|xn−xm|

r

)α}+ v1 + v2δnm

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 4

Page 10: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process

Let’s define

f =

f (x1)f (x2)

...f (xN)

,µ =

µ(x1)µ(x2)

...µ(xN)

,K =

κ(x1, x1) . . . κ(x1, xN)κ(x2, x1) . . . κ(x2, xN)

.... . .

...κ(xN , x1) . . . κ(xN , xN)

Note: K is also called the kernel matrix. Knm = κ(xn, xm)

Thus we havef ∼ N (µ,K)

Often, we assume the mean function to be zero. Thus f ∼ N (0,K)

Covariance/kernel function κ measures similarity between two inputs

κ(xn, xm) = exp(− ||xn−xm||2

γ

): RBF kernel

κ(xn, xm) = v0 exp{−(|xn−xm|

r

)α}+ v1 + v2δnm

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 4

Page 11: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Kernel Functions

Covariance/kernel function κ measures similarity between two inputs

Corresponds to implicitly mapping data to a higher dimensional space via afeature mapping φ (x → φ(x)) and computing the dot product that space

κ(xn, xm) = φ(xn)>φ(xm)

Popularly known as the kernel trick (used in kernel methods for nonlinearregression/classification/clustering/dimensionality reduction, etc.)

Allows extending linear models to nonlinear problems

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 5

Page 12: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Today’s Plan

Gaussian Processes for two problems

Nonlinear Regression: Gaussian Process Regression

Nonlinear Dimensionality Reduction: Gaussian Process Latent VariableModels (GPLVM)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 6

Page 13: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 7

Page 14: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Training data D: {xn, yn}Nn=1. xn ∈ RD , yn ∈ R

Assume the responses to be a noisy function of the inputs

yn = f (xn) + εn = fn + εn

Don’t a priori know the form of f (linear/polynomial/something else?)

Want to learn f with error bars

We’ll use GP prior on f and use Bayes rule to get the posterior on f

p(f |D) =p(f )p(D|f )

p(D)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 8

Page 15: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Training data D: {xn, yn}Nn=1. xn ∈ RD , yn ∈ R

Assume the responses to be a noisy function of the inputs

yn = f (xn) + εn = fn + εn

Don’t a priori know the form of f (linear/polynomial/something else?)

Want to learn f with error bars

We’ll use GP prior on f and use Bayes rule to get the posterior on f

p(f |D) =p(f )p(D|f )

p(D)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 8

Page 16: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Training data D: {xn, yn}Nn=1. xn ∈ RD , yn ∈ R

Assume the responses to be a noisy function of the inputs

yn = f (xn) + εn = fn + εn

Don’t a priori know the form of f (linear/polynomial/something else?)

Want to learn f with error bars

We’ll use GP prior on f and use Bayes rule to get the posterior on f

p(f |D) =p(f )p(D|f )

p(D)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 8

Page 17: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Training data: {xn, yn}Nn=1. xn ∈ RD , yn ∈ R

Assume the responses to be a noisy function of the inputs

yn = f (xn) + εn = fn + εn

Assume a zero-mean Gaussian error: εn ∼ N (εn|0, σ2)

Thus the likelihood model

p(yn|fn) = N (yn|fn, σ2)

For N i.i.d. responses, the joint likelihood can be written as

p(y |f) = N (y |f, σ2IN)

We will assume a zero mean Gaussian Process prior on f , which means:

p(f) = N (f|0,K)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 9

Page 18: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Training data: {xn, yn}Nn=1. xn ∈ RD , yn ∈ R

Assume the responses to be a noisy function of the inputs

yn = f (xn) + εn = fn + εn

Assume a zero-mean Gaussian error: εn ∼ N (εn|0, σ2)

Thus the likelihood model

p(yn|fn) = N (yn|fn, σ2)

For N i.i.d. responses, the joint likelihood can be written as

p(y |f) = N (y |f, σ2IN)

We will assume a zero mean Gaussian Process prior on f , which means:

p(f) = N (f|0,K)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 9

Page 19: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Training data: {xn, yn}Nn=1. xn ∈ RD , yn ∈ R

Assume the responses to be a noisy function of the inputs

yn = f (xn) + εn = fn + εn

Assume a zero-mean Gaussian error: εn ∼ N (εn|0, σ2)

Thus the likelihood model

p(yn|fn) = N (yn|fn, σ2)

For N i.i.d. responses, the joint likelihood can be written as

p(y |f) = N (y |f, σ2IN)

We will assume a zero mean Gaussian Process prior on f , which means:

p(f) = N (f|0,K)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 9

Page 20: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

Training data: {xn, yn}Nn=1. xn ∈ RD , yn ∈ R

Assume the responses to be a noisy function of the inputs

yn = f (xn) + εn = fn + εn

Assume a zero-mean Gaussian error: εn ∼ N (εn|0, σ2)

Thus the likelihood model

p(yn|fn) = N (yn|fn, σ2)

For N i.i.d. responses, the joint likelihood can be written as

p(y |f) = N (y |f, σ2IN)

We will assume a zero mean Gaussian Process prior on f , which means:

p(f) = N (f|0,K)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 9

Page 21: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

The likelihood modelp(y |f) = N (y |f, σ2IN)

The prior distributionp(f) = N (f|0,K)

Note: We don’t actually need to compute the posterior p(f|y) here

The marginal distribution of the training data responses y

p(y) =

∫p(y |f)p(f)df = N (y |0,K + σ2IN) = N (y |0,CN)

What will be the prediction y∗ for a new test example x∗?

Well, we know that the marginal distribution of y∗ will be

p(y∗) = N (y∗|0, κ(x∗, x∗) + σ2)

But what we actually want is the predictive distribution p(y∗|y)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 10

Page 22: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

The likelihood modelp(y |f) = N (y |f, σ2IN)

The prior distributionp(f) = N (f|0,K)

Note: We don’t actually need to compute the posterior p(f|y) here

The marginal distribution of the training data responses y

p(y) =

∫p(y |f)p(f)df = N (y |0,K + σ2IN) = N (y |0,CN)

What will be the prediction y∗ for a new test example x∗?

Well, we know that the marginal distribution of y∗ will be

p(y∗) = N (y∗|0, κ(x∗, x∗) + σ2)

But what we actually want is the predictive distribution p(y∗|y)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 10

Page 23: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

The likelihood modelp(y |f) = N (y |f, σ2IN)

The prior distributionp(f) = N (f|0,K)

Note: We don’t actually need to compute the posterior p(f|y) here

The marginal distribution of the training data responses y

p(y) =

∫p(y |f)p(f)df = N (y |0,K + σ2IN) = N (y |0,CN)

What will be the prediction y∗ for a new test example x∗?

Well, we know that the marginal distribution of y∗ will be

p(y∗) = N (y∗|0, κ(x∗, x∗) + σ2)

But what we actually want is the predictive distribution p(y∗|y)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 10

Page 24: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

The likelihood modelp(y |f) = N (y |f, σ2IN)

The prior distributionp(f) = N (f|0,K)

Note: We don’t actually need to compute the posterior p(f|y) here

The marginal distribution of the training data responses y

p(y) =

∫p(y |f)p(f)df = N (y |0,K + σ2IN) = N (y |0,CN)

What will be the prediction y∗ for a new test example x∗?

Well, we know that the marginal distribution of y∗ will be

p(y∗) = N (y∗|0, κ(x∗, x∗) + σ2)

But what we actually want is the predictive distribution p(y∗|y)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 10

Page 25: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

The likelihood modelp(y |f) = N (y |f, σ2IN)

The prior distributionp(f) = N (f|0,K)

Note: We don’t actually need to compute the posterior p(f|y) here

The marginal distribution of the training data responses y

p(y) =

∫p(y |f)p(f)df = N (y |0,K + σ2IN) = N (y |0,CN)

What will be the prediction y∗ for a new test example x∗?

Well, we know that the marginal distribution of y∗ will be

p(y∗) = N (y∗|0, κ(x∗, x∗) + σ2)

But what we actually want is the predictive distribution p(y∗|y)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 10

Page 26: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Regression

The likelihood modelp(y |f) = N (y |f, σ2IN)

The prior distributionp(f) = N (f|0,K)

Note: We don’t actually need to compute the posterior p(f|y) here

The marginal distribution of the training data responses y

p(y) =

∫p(y |f)p(f)df = N (y |0,K + σ2IN) = N (y |0,CN)

What will be the prediction y∗ for a new test example x∗?

Well, we know that the marginal distribution of y∗ will be

p(y∗) = N (y∗|0, κ(x∗, x∗) + σ2)

But what we actually want is the predictive distribution p(y∗|y)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 10

Page 27: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Let’s consider the joint distr. of N training responses y and test response y∗

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],CN+1

)where the (N + 1)× (N + 1) matrix CN+1 is given by

CN+1 =

[CN k∗k∗> c

]

and k∗ = [k(x∗, x1), . . . , k(x∗, xN)]>, c = k(x∗, x∗) + σ2

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 11

Page 28: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Let’s consider the joint distr. of N training responses y and test response y∗

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],CN+1

)where the (N + 1)× (N + 1) matrix CN+1 is given by

CN+1 =

[CN k∗k∗> c

]and k∗ = [k(x∗, x1), . . . , k(x∗, xN)]>, c = k(x∗, x∗) + σ2

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 11

Page 29: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Let’s consider the joint distr. of N training responses y and test response y∗

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],CN+1

)where the (N + 1)× (N + 1) matrix CN+1 is given by

CN+1 =

[CN k∗k∗> c

]and k∗ = [k(x∗, x1), . . . , k(x∗, xN)]>, c = k(x∗, x∗) + σ2

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 11

Page 30: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Given the jointly Gaussian distribution

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],

[CN k∗k∗> c

])The predictive distribution will be

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Follows readily from property of Gaussians (lecture 2 and PRML 2.94-2.96)

Note: Instead of explicitly inverting, often Cholesky decompositionCN = LL> is used (for better numerical stability)

Test time cost is O(N): linear in the number of training examples (just likekernel SVM or nearest neighbor methods)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 12

Page 31: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Given the jointly Gaussian distribution

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],

[CN k∗k∗> c

])The predictive distribution will be

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Follows readily from property of Gaussians (lecture 2 and PRML 2.94-2.96)

Note: Instead of explicitly inverting, often Cholesky decompositionCN = LL> is used (for better numerical stability)

Test time cost is O(N): linear in the number of training examples (just likekernel SVM or nearest neighbor methods)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 12

Page 32: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Given the jointly Gaussian distribution

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],

[CN k∗k∗> c

])The predictive distribution will be

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Follows readily from property of Gaussians (lecture 2 and PRML 2.94-2.96)

Note: Instead of explicitly inverting, often Cholesky decompositionCN = LL> is used (for better numerical stability)

Test time cost

is O(N): linear in the number of training examples (just likekernel SVM or nearest neighbor methods)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 12

Page 33: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Given the jointly Gaussian distribution

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],

[CN k∗k∗> c

])The predictive distribution will be

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Follows readily from property of Gaussians (lecture 2 and PRML 2.94-2.96)

Note: Instead of explicitly inverting, often Cholesky decompositionCN = LL> is used (for better numerical stability)

Test time cost is O(N)

: linear in the number of training examples (just likekernel SVM or nearest neighbor methods)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 12

Page 34: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Making Predictions

Given the jointly Gaussian distribution

p

([y

y∗

])= N

([y

y∗

]∣∣∣∣ [ 00

],

[CN k∗k∗> c

])The predictive distribution will be

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Follows readily from property of Gaussians (lecture 2 and PRML 2.94-2.96)

Note: Instead of explicitly inverting, often Cholesky decompositionCN = LL> is used (for better numerical stability)

Test time cost is O(N): linear in the number of training examples (just likekernel SVM or nearest neighbor methods)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 12

Page 35: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP Regression: Pictorially

A GP with squared-exponential kernel function

Shaded area denotes twice the standard deviation at each input

Picture courtesy: https://pythonhosted.org/infpy/gps.html

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 13

Page 36: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP Regression: Pictorially

A GP with squared-exponential kernel function

Shaded area denotes twice the standard deviation at each input

Picture courtesy: https://pythonhosted.org/infpy/gps.html

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 13

Page 37: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP Regression: Pictorially

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 14

Page 38: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP Regression: Pictorially

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 14

Page 39: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP Regression: Pictorially

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 14

Page 40: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP Regression: Pictorially

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 14

Page 41: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Interpreting GP predictions..

Let’s look at the predictions made by GP regression

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Two interpretations for the mean prediction µ∗

An SVM like interpretation

µ∗ = k∗>C−1

N y = k∗>α =

N∑n=1

k(x∗, xn)αn

where α is akin to the weights of support vectors

A nearest neighbors interpretation

µ∗ = k∗>C−1

N y = w>y =

N∑n=1

wnyn

where w is akin to the weights of the neighbors

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 15

Page 42: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Interpreting GP predictions..

Let’s look at the predictions made by GP regression

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Two interpretations for the mean prediction µ∗

An SVM like interpretation

µ∗ = k∗>C−1

N y = k∗>α =

N∑n=1

k(x∗, xn)αn

where α is akin to the weights of support vectors

A nearest neighbors interpretation

µ∗ = k∗>C−1

N y = w>y =

N∑n=1

wnyn

where w is akin to the weights of the neighbors

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 15

Page 43: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Interpreting GP predictions..

Let’s look at the predictions made by GP regression

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Two interpretations for the mean prediction µ∗

An SVM like interpretation

µ∗ = k∗>C−1

N y = k∗>α =

N∑n=1

k(x∗, xn)αn

where α is akin to the weights of support vectors

A nearest neighbors interpretation

µ∗ = k∗>C−1

N y = w>y =

N∑n=1

wnyn

where w is akin to the weights of the neighbors

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 15

Page 44: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Interpreting GP predictions..

Let’s look at the predictions made by GP regression

p(y∗|y) = N (y∗|µ∗, σ2∗)

µ∗ = k∗>C−1N y

σ2∗ = k(x∗, x∗) + σ2 − k∗

>C−1N k∗

Two interpretations for the mean prediction µ∗

An SVM like interpretation

µ∗ = k∗>C−1

N y = k∗>α =

N∑n=1

k(x∗, xn)αn

where α is akin to the weights of support vectors

A nearest neighbors interpretation

µ∗ = k∗>C−1

N y = w>y =

N∑n=1

wnyn

where w is akin to the weights of the neighbors

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 15

Page 45: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

There are two hyperparameters in GP regression models

Variance of the Gaussian noise σ2

Hyperparameters θ of the covariance function κ, e.g.,

κ(xn, xm) = exp

(−||xn − xm||2

γ

)(RBF kernel)

κ(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)(ARD kernel)

These can be learned from data by maximizing the marginal likelihood

p(y |σ2, θ) = N (y |0, σ2IN + Kθ)

Can maximize the (log) marginal likelihood w.r.t. σ2 and the kernelhyperparameters θ and get point estimates of the hyperparameters

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 16

Page 46: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

There are two hyperparameters in GP regression models

Variance of the Gaussian noise σ2

Hyperparameters θ of the covariance function κ, e.g.,

κ(xn, xm) = exp

(−||xn − xm||2

γ

)(RBF kernel)

κ(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)(ARD kernel)

These can be learned from data by maximizing the marginal likelihood

p(y |σ2, θ) = N (y |0, σ2IN + Kθ)

Can maximize the (log) marginal likelihood w.r.t. σ2 and the kernelhyperparameters θ and get point estimates of the hyperparameters

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 16

Page 47: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

There are two hyperparameters in GP regression models

Variance of the Gaussian noise σ2

Hyperparameters θ of the covariance function κ, e.g.,

κ(xn, xm) = exp

(−||xn − xm||2

γ

)(RBF kernel)

κ(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)(ARD kernel)

These can be learned from data by maximizing the marginal likelihood

p(y |σ2, θ) = N (y |0, σ2IN + Kθ)

Can maximize the (log) marginal likelihood w.r.t. σ2 and the kernelhyperparameters θ and get point estimates of the hyperparameters

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 16

Page 48: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

The (log) marginal likelihood

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Defining Ky = σ2IN + Kθ and taking derivative w.r.t. kernel hyperparams θ

∂θjlog p(y |σ2

, θ) = −1

2tr

(K−1

y

∂Ky

∂θj

)+

1

2y>K−1

y

∂Ky

∂θjK−1

y y

=1

2tr

((αα

> − K−1y )

∂Ky

∂θj

)

where θj is the j th hyperparam. of the kernel, and α = K−1y y

No closed form solution for θj . Gradient based methods can be used.

Note: Computing K−1y itself takes O(N3) time (faster approximations exist

though). Then each gradient computation takes O(N2) time

Form of∂Ky

∂θjdepends on the covariance/kernel function κ

Noise variance σ2 can also be estimated likewise

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 17

Page 49: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

The (log) marginal likelihood

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Defining Ky = σ2IN + Kθ and taking derivative w.r.t. kernel hyperparams θ

∂θjlog p(y |σ2

, θ) = −1

2tr

(K−1

y

∂Ky

∂θj

)+

1

2y>K−1

y

∂Ky

∂θjK−1

y y

=1

2tr

((αα

> − K−1y )

∂Ky

∂θj

)

where θj is the j th hyperparam. of the kernel, and α = K−1y y

No closed form solution for θj . Gradient based methods can be used.

Note: Computing K−1y itself takes O(N3) time (faster approximations exist

though). Then each gradient computation takes O(N2) time

Form of∂Ky

∂θjdepends on the covariance/kernel function κ

Noise variance σ2 can also be estimated likewise

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 17

Page 50: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

The (log) marginal likelihood

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Defining Ky = σ2IN + Kθ and taking derivative w.r.t. kernel hyperparams θ

∂θjlog p(y |σ2

, θ) = −1

2tr

(K−1

y

∂Ky

∂θj

)+

1

2y>K−1

y

∂Ky

∂θjK−1

y y

=1

2tr

((αα

> − K−1y )

∂Ky

∂θj

)

where θj is the j th hyperparam. of the kernel, and α = K−1y y

No closed form solution for θj . Gradient based methods can be used.

Note: Computing K−1y itself takes O(N3) time (faster approximations exist

though). Then each gradient computation takes O(N2) time

Form of∂Ky

∂θjdepends on the covariance/kernel function κ

Noise variance σ2 can also be estimated likewise

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 17

Page 51: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

The (log) marginal likelihood

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Defining Ky = σ2IN + Kθ and taking derivative w.r.t. kernel hyperparams θ

∂θjlog p(y |σ2

, θ) = −1

2tr

(K−1

y

∂Ky

∂θj

)+

1

2y>K−1

y

∂Ky

∂θjK−1

y y

=1

2tr

((αα

> − K−1y )

∂Ky

∂θj

)

where θj is the j th hyperparam. of the kernel, and α = K−1y y

No closed form solution for θj . Gradient based methods can be used.

Note: Computing K−1y itself takes O(N3) time (faster approximations exist

though). Then each gradient computation takes O(N2) time

Form of∂Ky

∂θjdepends on the covariance/kernel function κ

Noise variance σ2 can also be estimated likewise

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 17

Page 52: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

The (log) marginal likelihood

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Defining Ky = σ2IN + Kθ and taking derivative w.r.t. kernel hyperparams θ

∂θjlog p(y |σ2

, θ) = −1

2tr

(K−1

y

∂Ky

∂θj

)+

1

2y>K−1

y

∂Ky

∂θjK−1

y y

=1

2tr

((αα

> − K−1y )

∂Ky

∂θj

)

where θj is the j th hyperparam. of the kernel, and α = K−1y y

No closed form solution for θj . Gradient based methods can be used.

Note: Computing K−1y itself takes O(N3) time (faster approximations exist

though). Then each gradient computation takes O(N2) time

Form of∂Ky

∂θjdepends on the covariance/kernel function κ

Noise variance σ2 can also be estimated likewise

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 17

Page 53: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Inferring Hyperparameters

The (log) marginal likelihood

log p(y |σ2, θ) = −

1

2log |σ2IN + Kθ| −

1

2y>(σ2IN + Kθ)

−1y + const

Defining Ky = σ2IN + Kθ and taking derivative w.r.t. kernel hyperparams θ

∂θjlog p(y |σ2

, θ) = −1

2tr

(K−1

y

∂Ky

∂θj

)+

1

2y>K−1

y

∂Ky

∂θjK−1

y y

=1

2tr

((αα

> − K−1y )

∂Ky

∂θj

)

where θj is the j th hyperparam. of the kernel, and α = K−1y y

No closed form solution for θj . Gradient based methods can be used.

Note: Computing K−1y itself takes O(N3) time (faster approximations exist

though). Then each gradient computation takes O(N2) time

Form of∂Ky

∂θjdepends on the covariance/kernel function κ

Noise variance σ2 can also be estimated likewise

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 17

Page 54: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Processes with GLMs

GP regression is only one example of supervised learning with GP

GP can be combined with other types of likelihood functions to handle othertypes of responses (e.g., binary, categorical, counts, etc.) by replacing theGaussian likelihood for responses by a generalized linear model

Inference however becomes more tricky because such likelihoods may nolonger be conjugate to GP prior. Approximate inference needed in such cases.

We will revisit one such example (GP for binary classification) later duringthe semester

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 18

Page 55: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Processes with GLMs

GP regression is only one example of supervised learning with GP

GP can be combined with other types of likelihood functions to handle othertypes of responses (e.g., binary, categorical, counts, etc.) by replacing theGaussian likelihood for responses by a generalized linear model

Inference however becomes more tricky because such likelihoods may nolonger be conjugate to GP prior. Approximate inference needed in such cases.

We will revisit one such example (GP for binary classification) later duringthe semester

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 18

Page 56: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Processes with GLMs

GP regression is only one example of supervised learning with GP

GP can be combined with other types of likelihood functions to handle othertypes of responses (e.g., binary, categorical, counts, etc.) by replacing theGaussian likelihood for responses by a generalized linear model

Inference however becomes more tricky because such likelihoods may nolonger be conjugate to GP prior. Approximate inference needed in such cases.

We will revisit one such example (GP for binary classification) later duringthe semester

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 18

Page 57: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Processes with GLMs

GP regression is only one example of supervised learning with GP

GP can be combined with other types of likelihood functions to handle othertypes of responses (e.g., binary, categorical, counts, etc.) by replacing theGaussian likelihood for responses by a generalized linear model

Inference however becomes more tricky because such likelihoods may nolonger be conjugate to GP prior. Approximate inference needed in such cases.

We will revisit one such example (GP for binary classification) later duringthe semester

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 18

Page 58: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

The objective function of a soft-margin SVM looks like

1

2||w ||2 + C

N∑n=1

(1− ynfn)+

where fn = w>xn and yn is the true label for xn

Kernel SVM: fn =∑N

m=1 αmk(xn, xm). Denote f = [f1, . . . , fN ]>

We can write ||w ||2

2 = α>Kα = f>K−1f, and kernel SVM objective becomes

1

2f>K−1f + C

N∑n=1

(1− ynfn)+

Negative log-posterior log p(y |f)p(f) of a GP can be written as

1

2f>K−1f−

N∑n=1

log p(yn|fn) + const

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 19

Page 59: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

The objective function of a soft-margin SVM looks like

1

2||w ||2 + C

N∑n=1

(1− ynfn)+

where fn = w>xn and yn is the true label for xn

Kernel SVM: fn =∑N

m=1 αmk(xn, xm). Denote f = [f1, . . . , fN ]>

We can write ||w ||2

2 = α>Kα = f>K−1f, and kernel SVM objective becomes

1

2f>K−1f + C

N∑n=1

(1− ynfn)+

Negative log-posterior log p(y |f)p(f) of a GP can be written as

1

2f>K−1f−

N∑n=1

log p(yn|fn) + const

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 19

Page 60: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

The objective function of a soft-margin SVM looks like

1

2||w ||2 + C

N∑n=1

(1− ynfn)+

where fn = w>xn and yn is the true label for xn

Kernel SVM: fn =∑N

m=1 αmk(xn, xm). Denote f = [f1, . . . , fN ]>

We can write ||w ||2

2 = α>Kα = f>K−1f, and kernel SVM objective becomes

1

2f>K−1f + C

N∑n=1

(1− ynfn)+

Negative log-posterior log p(y |f)p(f) of a GP can be written as

1

2f>K−1f−

N∑n=1

log p(yn|fn) + const

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 19

Page 61: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

The objective function of a soft-margin SVM looks like

1

2||w ||2 + C

N∑n=1

(1− ynfn)+

where fn = w>xn and yn is the true label for xn

Kernel SVM: fn =∑N

m=1 αmk(xn, xm). Denote f = [f1, . . . , fN ]>

We can write ||w ||2

2 = α>Kα = f>K−1f, and kernel SVM objective becomes

1

2f>K−1f + C

N∑n=1

(1− ynfn)+

Negative log-posterior log p(y |f)p(f) of a GP can be written as

1

2f>K−1f−

N∑n=1

log p(yn|fn) + const

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 19

Page 62: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

Thus GPs can be interpreted as a Bayesian analogue of kernel SVMs

Both GP and SVM need dealing with (storing/inverting) large kernel matrices

Various approximations proposed to address this issue (applicable to both)

Ability to learn the kernel hyperparameters in GP is very useful, e.g.,

Learning the kernel bandwidth for Gaussian kernels

k(xn, xm) = exp

(−||xn − xm||2

γ

)

Doing feature selection (via Automatic Relevance Determination)

k(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)

Learning compositions of kernels for more flexible modeling

K = Kθ1 + Kθ2 + . . .

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 20

Page 63: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

Thus GPs can be interpreted as a Bayesian analogue of kernel SVMs

Both GP and SVM need dealing with (storing/inverting) large kernel matrices

Various approximations proposed to address this issue (applicable to both)

Ability to learn the kernel hyperparameters in GP is very useful, e.g.,

Learning the kernel bandwidth for Gaussian kernels

k(xn, xm) = exp

(−||xn − xm||2

γ

)

Doing feature selection (via Automatic Relevance Determination)

k(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)

Learning compositions of kernels for more flexible modeling

K = Kθ1 + Kθ2 + . . .

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 20

Page 64: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

Thus GPs can be interpreted as a Bayesian analogue of kernel SVMs

Both GP and SVM need dealing with (storing/inverting) large kernel matrices

Various approximations proposed to address this issue (applicable to both)

Ability to learn the kernel hyperparameters in GP is very useful, e.g.,

Learning the kernel bandwidth for Gaussian kernels

k(xn, xm) = exp

(−||xn − xm||2

γ

)

Doing feature selection (via Automatic Relevance Determination)

k(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)

Learning compositions of kernels for more flexible modeling

K = Kθ1 + Kθ2 + . . .

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 20

Page 65: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

Thus GPs can be interpreted as a Bayesian analogue of kernel SVMs

Both GP and SVM need dealing with (storing/inverting) large kernel matrices

Various approximations proposed to address this issue (applicable to both)

Ability to learn the kernel hyperparameters in GP is very useful, e.g.,

Learning the kernel bandwidth for Gaussian kernels

k(xn, xm) = exp

(−||xn − xm||2

γ

)

Doing feature selection (via Automatic Relevance Determination)

k(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)

Learning compositions of kernels for more flexible modeling

K = Kθ1 + Kθ2 + . . .

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 20

Page 66: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

Thus GPs can be interpreted as a Bayesian analogue of kernel SVMs

Both GP and SVM need dealing with (storing/inverting) large kernel matrices

Various approximations proposed to address this issue (applicable to both)

Ability to learn the kernel hyperparameters in GP is very useful, e.g.,

Learning the kernel bandwidth for Gaussian kernels

k(xn, xm) = exp

(−||xn − xm||2

γ

)

Doing feature selection (via Automatic Relevance Determination)

k(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)

Learning compositions of kernels for more flexible modeling

K = Kθ1 + Kθ2 + . . .

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 20

Page 67: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GP vs (Kernel) SVM

Thus GPs can be interpreted as a Bayesian analogue of kernel SVMs

Both GP and SVM need dealing with (storing/inverting) large kernel matrices

Various approximations proposed to address this issue (applicable to both)

Ability to learn the kernel hyperparameters in GP is very useful, e.g.,

Learning the kernel bandwidth for Gaussian kernels

k(xn, xm) = exp

(−||xn − xm||2

γ

)

Doing feature selection (via Automatic Relevance Determination)

k(xn, xm) = exp

(−

D∑d=1

(xnd − xmd)2

γd

)

Learning compositions of kernels for more flexible modeling

K = Kθ1 + Kθ2 + . . .

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 20

Page 68: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Nonlinear DimensionalityReduction using Gaussian Process

(GPLVM)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 21

Page 69: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Why Nonlinear Dimensionality Reduction?

Embeddings learned by PCA (left: original data, right: PCA)

Why PCA doesn’t work in such cases?

Uses Euclidean distances; learns linear projections

Embeddings learned by nonlinear dim. red. (left: LLE, right: ISOMAP)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 22

Page 70: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Why Nonlinear Dimensionality Reduction?

Embeddings learned by PCA (left: original data, right: PCA)

Why PCA doesn’t work in such cases?

Uses Euclidean distances; learns linear projections

Embeddings learned by nonlinear dim. red. (left: LLE, right: ISOMAP)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 22

Page 71: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Why Nonlinear Dimensionality Reduction?

Embeddings learned by PCA (left: original data, right: PCA)

Why PCA doesn’t work in such cases?

Uses Euclidean distances; learns linear projections

Embeddings learned by nonlinear dim. red. (left: LLE, right: ISOMAP)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 22

Page 72: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Why Nonlinear Dimensionality Reduction?

Embeddings learned by PCA (left: original data, right: PCA)

Why PCA doesn’t work in such cases?

Uses Euclidean distances; learns linear projections

Embeddings learned by nonlinear dim. red. (left: LLE, right: ISOMAP)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 22

Page 73: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Recap: Probabilistic PCA

Given: N × D data matrix X = [x>1 , . . . , x>N ]>, with xn ∈ RD

Goal: Find a lower-dim. rep., an N ×K matrix Z = [z>1 , . . . , z>N ]>, zn ∈ RK

Assume the following generative model for each observation xn

xn = Wzn + εn with W ∈ RD×K , εn ∼ N (0, σ2)

The conditional distribution

p(xn|zn,W, σ2) = N (Wzn, σ2ID)

Assume a Gaussian prior on zn: p(zn) = N (0, IK )

The marginal distribution of xn (after integrating out latent variables zn)

p(xn|W, σ2) = N (0,WW> + σ2ID)

p(X|W, σ2) =N∏

n=1

p(xn|W, σ2)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 23

Page 74: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Recap: Probabilistic PCA

Given: N × D data matrix X = [x>1 , . . . , x>N ]>, with xn ∈ RD

Goal: Find a lower-dim. rep., an N ×K matrix Z = [z>1 , . . . , z>N ]>, zn ∈ RK

Assume the following generative model for each observation xn

xn = Wzn + εn with W ∈ RD×K , εn ∼ N (0, σ2)

The conditional distribution

p(xn|zn,W, σ2) = N (Wzn, σ2ID)

Assume a Gaussian prior on zn: p(zn) = N (0, IK )

The marginal distribution of xn (after integrating out latent variables zn)

p(xn|W, σ2) = N (0,WW> + σ2ID)

p(X|W, σ2) =N∏

n=1

p(xn|W, σ2)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 23

Page 75: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Recap: Probabilistic PCA

Given: N × D data matrix X = [x>1 , . . . , x>N ]>, with xn ∈ RD

Goal: Find a lower-dim. rep., an N ×K matrix Z = [z>1 , . . . , z>N ]>, zn ∈ RK

Assume the following generative model for each observation xn

xn = Wzn + εn with W ∈ RD×K , εn ∼ N (0, σ2)

The conditional distribution

p(xn|zn,W, σ2) = N (Wzn, σ2ID)

Assume a Gaussian prior on zn: p(zn) = N (0, IK )

The marginal distribution of xn (after integrating out latent variables zn)

p(xn|W, σ2) = N (0,WW> + σ2ID)

p(X|W, σ2) =N∏

n=1

p(xn|W, σ2)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 23

Page 76: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Recap: Probabilistic PCA

Given: N × D data matrix X = [x>1 , . . . , x>N ]>, with xn ∈ RD

Goal: Find a lower-dim. rep., an N ×K matrix Z = [z>1 , . . . , z>N ]>, zn ∈ RK

Assume the following generative model for each observation xn

xn = Wzn + εn with W ∈ RD×K , εn ∼ N (0, σ2)

The conditional distribution

p(xn|zn,W, σ2) = N (Wzn, σ2ID)

Assume a Gaussian prior on zn: p(zn) = N (0, IK )

The marginal distribution of xn (after integrating out latent variables zn)

p(xn|W, σ2) = N (0,WW> + σ2ID)

p(X|W, σ2) =N∏

n=1

p(xn|W, σ2)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 23

Page 77: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Latent Variable Model (GPLVM)

Consider the same model

xn = Wzn + εn with W ∈ RD×K , εn ∼ N (0, σ2)

Assume a prior p(W) =∏D

d=1N (wd |0, IK ) where wd is the d th row of W

Suppose we integrate out W instead of zn (treat zn’s as “parameter”)

p(X|Z, σ2) =D∏

d=1

N (X:,d |0,ZZ> + σ2ID )

= (2π)−DN/2|Kz |−D/2 exp

(−

1

2tr(K−1

z XX>)

)

where Kz = ZZ> + σ2I and X:,d is the d th column of N × D data matrix X

Note that we can think of X:,d modeled by a GP regression model

X:,d ∼ N (0,ZZ> + σ2ID)

There are a total of D such GPs (one for each column of X)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 24

Page 78: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Latent Variable Model (GPLVM)

Consider the same model

xn = Wzn + εn with W ∈ RD×K , εn ∼ N (0, σ2)

Assume a prior p(W) =∏D

d=1N (wd |0, IK ) where wd is the d th row of W

Suppose we integrate out W instead of zn (treat zn’s as “parameter”)

p(X|Z, σ2) =D∏

d=1

N (X:,d |0,ZZ> + σ2ID )

= (2π)−DN/2|Kz |−D/2 exp

(−

1

2tr(K−1

z XX>)

)

where Kz = ZZ> + σ2I and X:,d is the d th column of N × D data matrix X

Note that we can think of X:,d modeled by a GP regression model

X:,d ∼ N (0,ZZ> + σ2ID)

There are a total of D such GPs (one for each column of X)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 24

Page 79: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Gaussian Process Latent Variable Model (GPLVM)

Consider the same model

xn = Wzn + εn with W ∈ RD×K , εn ∼ N (0, σ2)

Assume a prior p(W) =∏D

d=1N (wd |0, IK ) where wd is the d th row of W

Suppose we integrate out W instead of zn (treat zn’s as “parameter”)

p(X|Z, σ2) =D∏

d=1

N (X:,d |0,ZZ> + σ2ID )

= (2π)−DN/2|Kz |−D/2 exp

(−

1

2tr(K−1

z XX>)

)

where Kz = ZZ> + σ2I and X:,d is the d th column of N × D data matrix X

Note that we can think of X:,d modeled by a GP regression model

X:,d ∼ N (0,ZZ> + σ2ID)

There are a total of D such GPs (one for each column of X)

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 24

Page 80: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

GPLVM

p(X|Z, σ2) is now a product of D GPs (one per column of data matrix X)

p(X|Z, σ2) =D∏

d=1

N (X:,d |0,ZZ> + σ2ID )

= (2π)−DN/2|Kz |−D/2 exp

(−

1

2tr(K−1

z XX>)

)

Using Kz = ZZ> + σ2I and doing MLE will give the same solution for Z aslinear PCA (note that ZZ> is a linear kernel over Z, the low-dim rep of data)

But with Kz = K + σ2I (with K being some appropriately defined kernelmatrix over Z) will give nonlinear dimensionality reduction

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 25

Page 81: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

MLE for GPLVM

Log-likelihood is given by

L = −D

2log |Kz | −

1

2tr(K−1z XX>)

where Kz = K + σ2I and K denotes the kernel matrix of our low-dim rep. Z

The goal is to estimate the N × K matrix Z

Can’t find closed form estimate of Z. Need to use gradient-based methods,with the gradient given by

∂L∂Znk

=∂L∂Kz

∂Kz

∂Znk

where ∂L∂Kz

= K−1z XX>K−1z − DK−1z and ∂Kz

∂Znkwill depend on the kernel

function used (note: hyperparameters of the kernel can also be learned justas we did it in the GP regression case)

Can also impose a prior on Z and do MAP (or fully Bayesian) estimation

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 26

Page 82: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

MLE for GPLVM

Log-likelihood is given by

L = −D

2log |Kz | −

1

2tr(K−1z XX>)

where Kz = K + σ2I and K denotes the kernel matrix of our low-dim rep. Z

The goal is to estimate the N × K matrix Z

Can’t find closed form estimate of Z. Need to use gradient-based methods,with the gradient given by

∂L∂Znk

=∂L∂Kz

∂Kz

∂Znk

where ∂L∂Kz

= K−1z XX>K−1z − DK−1z and ∂Kz

∂Znkwill depend on the kernel

function used (note: hyperparameters of the kernel can also be learned justas we did it in the GP regression case)

Can also impose a prior on Z and do MAP (or fully Bayesian) estimation

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 26

Page 83: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

MLE for GPLVM

Log-likelihood is given by

L = −D

2log |Kz | −

1

2tr(K−1z XX>)

where Kz = K + σ2I and K denotes the kernel matrix of our low-dim rep. Z

The goal is to estimate the N × K matrix Z

Can’t find closed form estimate of Z. Need to use gradient-based methods,with the gradient given by

∂L∂Znk

=∂L∂Kz

∂Kz

∂Znk

where ∂L∂Kz

= K−1z XX>K−1z − DK−1z and ∂Kz

∂Znkwill depend on the kernel

function used (note: hyperparameters of the kernel can also be learned justas we did it in the GP regression case)

Can also impose a prior on Z and do MAP (or fully Bayesian) estimation

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 26

Page 84: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

MLE for GPLVM

Log-likelihood is given by

L = −D

2log |Kz | −

1

2tr(K−1z XX>)

where Kz = K + σ2I and K denotes the kernel matrix of our low-dim rep. Z

The goal is to estimate the N × K matrix Z

Can’t find closed form estimate of Z. Need to use gradient-based methods,with the gradient given by

∂L∂Znk

=∂L∂Kz

∂Kz

∂Znk

where ∂L∂Kz

= K−1z XX>K−1z − DK−1z and ∂Kz

∂Znkwill depend on the kernel

function used (note: hyperparameters of the kernel can also be learned justas we did it in the GP regression case)

Can also impose a prior on Z and do MAP (or fully Bayesian) estimation

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 26

Page 85: Gaussian Processes for Nonlinear Regression and …Gaussian Processes for Nonlinear Regression and Nonlinear Dimensionality Reduction Piyush Rai IIT Kanpur Probabilistic Machine Learning

Resources on Gaussian Processes

Book: Gaussian Processes for Machine Learning (freely available online)

MATLAB Packages: Useful to play with, build applications, extend existingmodels and inference algorithms for GPs (both regression and classification)

GPML: http://www.gaussianprocess.org/gpml/code/matlab/doc/

GPStuff: http://research.cs.aalto.fi/pml/software/gpstuff/

GPLVM: https://github.com/lawrennd/gplvm

Probabilistic ML (CS772A) Gaussian Processes for Nonlinear Regression and Dimensionality Reduction 27