CS 59000 Statistical Machine learning Lecture 15

CS 59000 Statistical Machine learningLecture 15

Yuan (Alan) QiPurdue CS

Oct. 21 2008

Outline

• Review of Gaussian Processes (GPs)• From linear regression to GP • GP for regression

• Learning hyperparameters• Automatic Relevance Determination• GP for classification

Gaussian Processes

How kernels arise naturally in a Bayesian setting?

Instead of assigning a prior on parameters w, we assign a prior on function value y.Infinite space in theory

Finite space in practice (finite number of training set and test set)

Linear Regression Revisited

We have

From Prior on Parameter to Prior on Function

The prior on function value:

Stochastic Process

A stochastic process is specified by giving the joint distribution for any finite set of values in a consistent manner (Loosely speaking, it means that a marginalized joint distribution is the same as the joint distribution that is defined in the subspace.)

Gaussian Processes

The joint distribution of any variables is a multivariable Gaussian distribution.

Without any prior knowledge, we often set mean to be 0. Then the GP is specified by the covariance :

Impact of Kernel FunctionCovariance matrix : kernel function

Application economics & finance

Gaussian Process for Regression

Likelihood:

Prior:

Marginal distribution:

Samples of Data Points

Predictive Distribution

is a Gaussian distribution with mean and variance:

Predictive Mean

is the nth component ofWe see the same form as kernel ridge

regression and kernel PCA.

GP Regression

Discussion: the difference between GP regression and Bayesian regression with Gaussian basis functions?

Computational Complexity

GP prediction for a new data point:

GP: O(N3) where N is number of data pointsBasis function model: O(M3) where M is the

dimension of the feature expansionWhen N is large: computationally expensive.Sparsification: make prediction based on only a few

data points (essentially make N small)

Learning Hyperparameters

Empirical Bayes Methods

Automatic Relevance Determination

Consider two-dimensional problems:

Maximizing the marginal likelihood will make certain small, reducing its relevance to prediction.

Example

t = sin(2 π x1)

x2 = x1 +n

x3 = e

Gaussian Processes for Classification

Likelihood:

GP Prior:

Covariance function:

Sample from GP Prior

No analytical solution.Approximate this integration:

Laplace’s methodVariational BayesExpectation propagation

Laplace’s method for GP Classification (1)

Taylor expansion:

Newton-Raphson update:

Gaussian approximation:

Question: How to get the mean and the variance above?

Example

CS 59000 Statistical Machine learning Lecture 15

gp regression

gp classification

marginal distribution

functionthe prior

prior knowledge

gaussian approximation

bayesian regression

predictive mean

Documents

59000 Double Pump Drive - deere.com · Funk™ Series 59000...

CS 59000 Statistical Machine learning Lecture 24

CS 59000 Statistical Machine learning Lecture 12 Yuan (Alan)...

Dec 2015 Mid-e e e - Amazon Web...

CS 595-052 Machine Learning and Statistical Natural...

Generic Statistical Business Process Model...

1 The R Project for statistical computing Eric Fouh,...

Agência Portuguesa do...

CS 59000 Statistical Machine learning Lecture 3

CS 294-5: Statistical Natural Language...

CS 188: Artificial Intelligence Learning III: Statistical...

CS 59000 Statistical Machine learning Lecture 18 Yuan (Alan)...

Practical statistical network analysis (with R and...

· 3rd year 4th year 2 year' year year year year Annual...

Multiﬁeld Visualization Using Local Statistical eld...

CS 59000 Statistical Machine learning Lecture 6