Dimensionality Reduction

E.G.M. Petrakis Dimensionality Reduction 1

Given N vectors in n dims, find the k most important axes to project themk is user defined (k < n)

Applications: information retrieval & indexingidentify the k most important features

or reduce indexing dimensions for faster

retrieval (low dim indices are faster)

Techniques

Eigenvalue analysis techniques [NR’92]Karhunen-Loeve (K-L) transformSingular Value Decomposition (SVD)both need O(N2) time

FastMap [Faloutsos & Lin 95] dimensionality reduction and mapping of objects to vectorsO(N) time

Mathematical Preliminaries

For an nxn square matrix S, for unit vector x and scalar value λ: Sx = λxx: eigenvector of Sλ: eigenvalue of S

The eigenvectors of a symmetric matrix (S=ST) are mutually orthogonal and its eigenvalues are realr rank of a matrix: maximum number or

independent columns or rows

Example 1

Intuition: S defines an affine transform y = Sx that involves scaling, rotationeigenvectors: unit vectors along the new

directionseigenvalues denote scaling

85.0 ,38.3

52.0 ,62.3

eigenvector of major axis

Example 2

If S is real and symmetric (S=ST) then it can be written as S = UΛUT

the columns of U are eigenvectors of SU: column orthogonal (UUT=I)Λ: diagonal with the eigenvalues of S

52.085.0

85.052.0

52.085.0

85.052.0

Karhunen-Loeve (K-L)

Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs)K-L gives a linear combination of axes sorted by importance keep the first k dims

2-dim points and the 2 K-L directionsfor k=1 keep x’

Computation of K-L

Put N vectors in rows in A=[aij]

Compute B=[aij-ap] , whereCovariance matrix: C=BTBCompute the eigenvectors of CSort in decreasing eigenvalue orderApproximate each object by its

projections on the directions of the first k eigenvectors

i ipp aNa1

Intuition

B shifts the origin of the center of gravity of the vectors by ap and has 0 column mean

C represents attribute to attribute similarityC square, real, symmetric Eigenvector and eigenvalues are computed

on C not on AC denotes the affine transform that

minimizes the error Approximate each vector with its

projections along the first k eigenvectors

Example

Input vectors [1 2], [1 1], [0 0]Then col.avgs are 2/3 and 1

0.88-u 13.0

47.0 u 53.2

13/2 and

For general rectangular matrixesNxn matrix (N vectors, n dimensions)groups similar entities (documents) togetherGroups similar terms together and each group

of terms corresponds to a concept

Given an Nxn matrix A, write it as A = UΛVT

U: Nxr column orthogonal (r: rank of A)Λ: rxr diagonal matrix (non-negative, desc.

order)V: rxn column orthogonal matrix

SVD (cont,d)

A = λ1u1v1T + λ2u2v2

T + … + λrurvrT

u, v are column vectors of U, V SVD identifies rect. blobs of related values in A The rank r of A: number of blobs

Example

Two types of documents: CS and MedicalTwo concepts (groups of terms)

CS: data, information, retrievalMedical: brain, lung

Term/Document

data information retrieval brain lung

CS-TR1 1 1 1 0 0

CS-TR2 2 2 2 0 0

CS-TR3 1 1 1 0 0

CS-TR4 5 5 5 0 0

MED-TR1 0 0 0 2 2

MED-TR2 0 0 0 3 3

MED-TR3 0 0 0 1 1

71.071.0000

0058.058.058.0

UExample (cont,d)

U: document-to-document similarity matrixV: term-to-document similarity matrix

v12 = 0: data has 0 similarity with the 2nd concept

SVD and LSI

SVD leads to “Latent Semantic Indexing” (http://lsi.research.telcordia.com/lsi/LSIpapers.html)

Terms that occur together are grouped into concepts

When a user searches for a term, the system determines the relevant concepts to search

LSI maps concepts to vectors in the concept space instead of the n-dim. document space

Concept space: is a lower dimensionality space

Examples of Queries Find documents with

the term “data” Translate query vector

q to concept space The query is related to

the CS concept and unrelated to the medical concept

LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query

71.071.0000

0058.058.058.0

qVq Tc

FastMap

Works with distances, has two roles:1. Maps objects to vectors so that

their distances are preserved (then apply SAMs for indexing)

2. Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible

Main idea

Pretend that objects are points in some unknown n-dimensional spaceproject these points on k mutually

orthogonal axescompute projections using distance only

The heart of FastMap is the method that projects two objects on a linetake 2 objects which are far apart (pivots)project on the line that connects the pivots

Project Objects on a Line

Oa, Ob: pivots, Oi: any object

dij: shorthand for D(Oi,Oj)

xi: first coordinate on a k dimensional space

If Oi is close to Oa, xi is small

biabaii

abiabaiib

Apply cosine low:

Choose Pivots

Complexity: O(N) The optimal algorithm would require O(N2) time

steps 2,3 can be repeated 4-5 times to improve the accuracy of selection

Extension for Many Dimensions

Consider the (n-1)-dimensional hyperplane H that is perpendicular to line Oab

Project objects on H and apply previous stepchoose two new pivotsthe new xi is the next object coordinate repeat this step until k dim. vectors are

obtained

The distance on H is not D D’: distance between projected objects

Distance on the Hyper-Plane H

D’ on H can be computed from the Pythagorean theorem

The ability to compute D’ allows for computing a second line on H etc.

22 )(2)()(' jijiji xxOODOOD

Pythagorean theorem:

Algorithm

Observations

Complexity: O(kN) distance calculationsk: desired dimensionalityk recursive calls, each takes O(N)

The algorithm records pivots in each call (dimension) to facilitate queriesthe query is mapped to a k-dimensional vector

by projecting it on the pivot lines for each dimension

O(1) computation/step: no need to compute pivots

Observations (cont,d)

The projected vectors can be indexedmapping on 2-3 dimensions allows for

visualization of the data spaceAssumes Euclidean space (triangle

rules)not always true (at least after second step)

Approximation of pivotssome distances are negativeturn negative distances to 0

Application: Document Vectors

)),(1(2))cos(1(2

)2/sin(2),(tan

ddsimilarity

ddcedis

FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3

References

Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996

W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988

LSI website: http://lsi.research.telcordia.com/lsi/LSIpapers.html

C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995

Dimensionality Reduction

symmetric matrix s

symmetric s

column vectors of u

nxn square matrix s

unit vectors

eigenvector of s

lput n vectors

wherecovariance matrix

Documents

Preprocessing and Dimensionality Reduction

Non-Linear Dimensionality Reduction

Nonlinear Dimensionality Reduction Frameworks

Dimensionality Reduction - University of...

Dimensionality reduction - Boston University

Learning Through Non-linearly Supervised Dimensionality...

Dimensionality reduction Usman Roshan CS 675. Dimensionality...

Dimensionality Reduction Compositional Data

12.2 Dimensionality Reduction

Dimensionality reduction: Some Assumptions

Dimensionality Reduction and Prior Knowledge in E...

Examples of Dimensionality Reduction

Dimensionality Reduction with PCA - Over ons ·...

Dimensionality reduction (pca)

17 Dimensionality Reduction

Dimensionality reduction using an edge ﬁnite...