Top Banner
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models – Neil D. Lawrence Presenters: Sean Golliher and Derek Reimanis 1 / 21
21

A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Aug 29, 2014

Download

Science

Sean Golliher

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

A Unifying Probabilistic Perspective for SpectralDimensionality Reduction: Insights and New

Models – Neil D. Lawrence

Presenters: Sean Golliher and Derek Reimanis

1 / 21

Page 2: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Claims and Contributions

“Unifying Approach”: Views previous methods as less general thatthis approach and proves they are special cases

Presents the Unified Algorithm: Maximum Entropy Unfolding (MEU)

Improves Local Linear Embedding (LLE) and introduces ALLE(Acyclic Version)

Introduces 3rd Algorithm DRILL. Estimating the dimensionalstructure from data (rather than K nearest neighbors)

Mostly theoretical paper

2 / 21

Page 3: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Features of Spectral Methods

Spectral Methods: Low dimensional representations are derived fromeigenvectors of specially constructed matrices.

Methods Share a Common Approach:

Compute nearest neighbors of input patternConstruct a weighted graph based on neighborhood relationsDerive a matrix from the weighted graphProduce a representation of the matrix from the eigenvectors of thematrixKeep this mind when trying to understand unification

3 / 21

Page 4: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Most Common Examples

PCA: Technique that preserves the covariance of the data structure

MDS Multidimensional Scaling: preserves inner products of inputpatterns. ex: ‖xi − xj‖2

ISOMAP: Preserves pairwise distances by measuring along asub-manifold from the sample space. Variant of MDS which usesgeodesics along sub-manifold

geodesic: curve whose tangent vectors remain parallel if they aretransported along it

4 / 21

Page 5: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Local Linear Embedding (LLE)

Tries to capture non-linear relationships by preserving local linearstructure of data.

Four Step Algorithm:

Step 1: Compute K-nearest neighbors of each higher-dimensional inputpattern xi and create directed graph whose edges indicate nearestneighbor relations.Step 2: Assign weights Wj to edges in graph. Each input pattern, andneighbors, can be viewed as samples from linear patch on a lowerdimensional sub manifold.Find the “reconstruction weights” that provide a local linear fit of thek + 1 one points at each of the n points in the data set. Minimizeerror function:E (W|D) =

∑x∈D

‖x−∑

y∈N(x)

Wxyy‖2

5 / 21

Page 6: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Local Linear Embedding (LLE) Cont’d

Step 3: After n local models are constructed keep weights fixed andfind new projection. We turn this into a minimization problem similarto MDS

E (Z|D) =∑x∈D

‖zx −∑

y∈N(x)

Wxyzy‖2

Put into a matrix Mxy that gives us a new objective function inmatrix form (error function):

E (Z |W ) =∑x,y

Mxy(zx)Tzy

Find the m + 1 eigenvectors with the smallest eigenvectors and dropthem. Giving us a new set of coordinates for new k-dimensional space.

6 / 21

Page 7: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Maximum Entropy Unfolding (MEU)

Kernel PCA idea: Apply “Kernal Trick” to transform to higherdimensional space. Then transform using covariance space as in PCA.

Increases the features space rather than reducing the dimensions.This motived the development of Maximum Variance Unfolding(MVU) .

MEU: Since entropy is related to variance they use entropy andobtain a probabilistic model. Derive a density over p(Y) directly (notover squared distances) by constraining the expected squareinter-point distances dij of any two samples (yi ) and (yj)

7 / 21

Page 8: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Maximum Entropy Unfolding (MEU) Cont’d

Observations are over squared distances they derive a density thatgives rise to those distances p(Y).

Uses KL divergence for calculating entropy between two distributions.However, we need two distributions: H = −

∫p(Y)log p(Y)

m(Y)dY

Assume a very broad, spherical, Gaussian density that can beassumed to be small.

Solving for Y the density that minimizes the KL divergence under theconstraints on the expectations is:

p(Y) =p∏

j=1

|L+γI|1/2τn/2 exp(−1/2yTj (L + γI))yj

Where L is a special matrix where off diagonal elements containinformation for D. Similar to previous examples.

Key point is that this assumes independence of density across datafeatures (multiplying densities)

8 / 21

Page 9: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Maximum Entropy Unfolding (MEU) Cont’d

GMRF: (finite-dimensional) random vector following a multivariatenormal (or Gaussian) distribution.

Since GRFs show provide an alternative approach to reducing numberof parameters in covariance matrix they use GRF approach which iscommon on spatial methods.

Independence in their model is being expressed over data featuresinstead of data points. This assumes features are i.i.d

For their model the number of parameters does not increase with thenumber of data. As number of features increases their is a clear“blessing of dimensionality”.

9 / 21

Page 10: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Maximum Entropy Unfolding (MEU) Cont’d

With the GRF representation they show that with some manipulationthey can compute < dij >from the covariance matrix:

< dij >= p/2(ki,j − 2ki,j + ki,j)

This is of the same form as distances and similarities in Kernel PCAmethods

10 / 21

Page 11: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

MEU and LLE

They show that LLE is approximating maximum likelihood in theirMEU model

Finding estimates of the parameters that maximize the probability ofobserving the data that we have

Thus the claim of a “unifying model”. LLE is special case of MEU.

Also showed that using pseudo-likelihood in the MEU model reducesto the generalization they presented in equation (9) for LLE.

Using fact that join probability density of GRF can be represented asa factorization of the cliques of the graph.

Pseudo-likelihood reduces the multiplication of the of the distributionpresented earlier to the objective matrix for LLE.

11 / 21

Page 12: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Acyclic Locally Linear Embedding (ALLE)

Pseudo-likelihood approximation to the joint probability distributionof a collection of random variables.

Thus the claim of a “unifying model”. LLE is special case of MEU.

Also showed that using pseudo-likelihood in the MEU model reducesto the generalization they presented in equation (9) for LLE.

If they force their matrix M to be lower triangular they that thecovariance of the try log-likelihood of y log p(Y) factors into nindependent regression functions.

This approach requires an ordering of the data because of therestriction for j > i in the matrix M

12 / 21

Page 13: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Dimensionality reduction through Regularization of theInverse covariance in the Log Likelihood (DRILL)

Optimization technique that applies L-1 regularization ofdependencies to estimate graph structure

How does this compare to the nearest-neighbor approaches?

E (Λ) = − log p(Y ) +∑i<j‖λi ,j‖1

Maximize E (Λ) through Least Absolute Shrinkage and SelectionOperator (LASSO) regression

13 / 21

Page 14: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Gaussian process latent variable model (GP-LVM)likelihood

Process which maps data from a latent space to a non-latent dataspace where the locale of points is determined by maximizing aGaussian likelihood.

This process allows us to map from a low dimensional space Rd

(latent space) to a high dimensional space RD (data space)

GP-LVM scoring

Define hyperparameter θ = (σ, θrbf , θnoise)lml = max

σ,θrbf ,θnoiselog p(Y |X , θ)

Find hyperparameter values through gradient descent

14 / 21

Page 15: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Motion Capture Experiments

15 / 21

Page 16: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Motion Capture Experiments

16 / 21

Page 17: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Robot Navigation Experiments

17 / 21

Page 18: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Robot Navigation Experiments

18 / 21

Page 19: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

GP-LVM scores for experiments

19 / 21

Page 20: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Structural learning for DRILL

20 / 21

Page 21: A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Discussion Questions

Is there an equivalent “no free lunch theorem” for dimensionalityreduction? Why or why not?

Were you convinced that this is a unifying approach?

If this is a unifying approach why is MEU the worst performer?

Does a theoretical paper such as this need more experiments?

Is the GP-LVM score a good performance measure for comparingthese methods?

Are the experiments themselves introducing any bias in the results?

Why do you think ALLE outperforms MEU?

Is the “blessing of dimensionality” a valid claim?

21 / 21