A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

A Unifying Probabilistic Perspective for SpectralDimensionality Reduction: Insights and New

Models – Neil D. Lawrence

Presenters: Sean Golliher and Derek Reimanis

1 / 21

Claims and Contributions

“Unifying Approach”: Views previous methods as less general thatthis approach and proves they are special cases

Presents the Unified Algorithm: Maximum Entropy Unfolding (MEU)

Improves Local Linear Embedding (LLE) and introduces ALLE(Acyclic Version)

Introduces 3rd Algorithm DRILL. Estimating the dimensionalstructure from data (rather than K nearest neighbors)

Mostly theoretical paper

2 / 21

Features of Spectral Methods

Spectral Methods: Low dimensional representations are derived fromeigenvectors of specially constructed matrices.

Methods Share a Common Approach:

Compute nearest neighbors of input patternConstruct a weighted graph based on neighborhood relationsDerive a matrix from the weighted graphProduce a representation of the matrix from the eigenvectors of thematrixKeep this mind when trying to understand unification

3 / 21

Most Common Examples

PCA: Technique that preserves the covariance of the data structure

MDS Multidimensional Scaling: preserves inner products of inputpatterns. ex: ‖xi − xj‖2

ISOMAP: Preserves pairwise distances by measuring along asub-manifold from the sample space. Variant of MDS which usesgeodesics along sub-manifold

geodesic: curve whose tangent vectors remain parallel if they aretransported along it

4 / 21

Local Linear Embedding (LLE)

Tries to capture non-linear relationships by preserving local linearstructure of data.

Four Step Algorithm:

Step 1: Compute K-nearest neighbors of each higher-dimensional inputpattern xi and create directed graph whose edges indicate nearestneighbor relations.Step 2: Assign weights Wj to edges in graph. Each input pattern, andneighbors, can be viewed as samples from linear patch on a lowerdimensional sub manifold.Find the “reconstruction weights” that provide a local linear fit of thek + 1 one points at each of the n points in the data set. Minimizeerror function:E (W|D) =

∑x∈D

‖x−∑

y∈N(x)

Wxyy‖2

5 / 21

Local Linear Embedding (LLE) Cont’d

Step 3: After n local models are constructed keep weights fixed andfind new projection. We turn this into a minimization problem similarto MDS

E (Z|D) =∑x∈D

‖zx −∑

y∈N(x)

Wxyzy‖2

Put into a matrix Mxy that gives us a new objective function inmatrix form (error function):

E (Z |W ) =∑x,y

Mxy(zx)Tzy

Find the m + 1 eigenvectors with the smallest eigenvectors and dropthem. Giving us a new set of coordinates for new k-dimensional space.

6 / 21

Maximum Entropy Unfolding (MEU)

Kernel PCA idea: Apply “Kernal Trick” to transform to higherdimensional space. Then transform using covariance space as in PCA.

Increases the features space rather than reducing the dimensions.This motived the development of Maximum Variance Unfolding(MVU) .

MEU: Since entropy is related to variance they use entropy andobtain a probabilistic model. Derive a density over p(Y) directly (notover squared distances) by constraining the expected squareinter-point distances dij of any two samples (yi ) and (yj)

7 / 21

Maximum Entropy Unfolding (MEU) Cont’d

Observations are over squared distances they derive a density thatgives rise to those distances p(Y).

Uses KL divergence for calculating entropy between two distributions.However, we need two distributions: H = −

∫p(Y)log p(Y)

m(Y)dY

Assume a very broad, spherical, Gaussian density that can beassumed to be small.

Solving for Y the density that minimizes the KL divergence under theconstraints on the expectations is:

p(Y) =p∏

j=1

|L+γI|1/2τn/2 exp(−1/2yTj (L + γI))yj

Where L is a special matrix where off diagonal elements containinformation for D. Similar to previous examples.

Key point is that this assumes independence of density across datafeatures (multiplying densities)

8 / 21


GMRF: (finite-dimensional) random vector following a multivariatenormal (or Gaussian) distribution.

Since GRFs show provide an alternative approach to reducing numberof parameters in covariance matrix they use GRF approach which iscommon on spatial methods.

Independence in their model is being expressed over data featuresinstead of data points. This assumes features are i.i.d

For their model the number of parameters does not increase with thenumber of data. As number of features increases their is a clear“blessing of dimensionality”.

9 / 21


With the GRF representation they show that with some manipulationthey can compute < dij >from the covariance matrix:

< dij >= p/2(ki,j − 2ki,j + ki,j)

This is of the same form as distances and similarities in Kernel PCAmethods

10 / 21

MEU and LLE

They show that LLE is approximating maximum likelihood in theirMEU model

Finding estimates of the parameters that maximize the probability ofobserving the data that we have

Thus the claim of a “unifying model”. LLE is special case of MEU.

Also showed that using pseudo-likelihood in the MEU model reducesto the generalization they presented in equation (9) for LLE.

Using fact that join probability density of GRF can be represented asa factorization of the cliques of the graph.

Pseudo-likelihood reduces the multiplication of the of the distributionpresented earlier to the objective matrix for LLE.

11 / 21

Acyclic Locally Linear Embedding (ALLE)

Pseudo-likelihood approximation to the joint probability distributionof a collection of random variables.

Thus the claim of a “unifying model”. LLE is special case of MEU.

Also showed that using pseudo-likelihood in the MEU model reducesto the generalization they presented in equation (9) for LLE.

If they force their matrix M to be lower triangular they that thecovariance of the try log-likelihood of y log p(Y) factors into nindependent regression functions.

This approach requires an ordering of the data because of therestriction for j > i in the matrix M

12 / 21

Dimensionality reduction through Regularization of theInverse covariance in the Log Likelihood (DRILL)

Optimization technique that applies L-1 regularization ofdependencies to estimate graph structure

How does this compare to the nearest-neighbor approaches?

E (Λ) = − log p(Y ) +∑i<j‖λi ,j‖1

Maximize E (Λ) through Least Absolute Shrinkage and SelectionOperator (LASSO) regression

13 / 21

Gaussian process latent variable model (GP-LVM)likelihood

Process which maps data from a latent space to a non-latent dataspace where the locale of points is determined by maximizing aGaussian likelihood.

This process allows us to map from a low dimensional space Rd

(latent space) to a high dimensional space RD (data space)

GP-LVM scoring

Define hyperparameter θ = (σ, θrbf , θnoise)lml = max

σ,θrbf ,θnoiselog p(Y |X , θ)

Find hyperparameter values through gradient descent

14 / 21

Motion Capture Experiments

15 / 21

Motion Capture Experiments

16 / 21

Robot Navigation Experiments

17 / 21

Robot Navigation Experiments

18 / 21

GP-LVM scores for experiments

19 / 21

Structural learning for DRILL

20 / 21

Discussion Questions

Is there an equivalent “no free lunch theorem” for dimensionalityreduction? Why or why not?

Were you convinced that this is a unifying approach?

If this is a unifying approach why is MEU the worst performer?

Does a theoretical paper such as this need more experiments?

Is the GP-LVM score a good performance measure for comparingthese methods?

Are the experiments themselves introducing any bias in the results?

Why do you think ALLE outperforms MEU?

Is the “blessing of dimensionality” a valid claim?

21 / 21

A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:

Science

meu model

latent space

unifying approach

unifying model

special case

nearest neighbors

kl divergence

dij gt