Dimensionality Reduction

E.G.M. Petrakis Dimensionality Reduction 1

Dimensionality Reduction

Given N vectors in n dims, find the k most important axes to project themk is user defined (k < n)

Applications: information retrieval & indexingidentify the k most important features

or reduce indexing dimensions for faster

retrieval (low dim indices are faster)


Techniques

Eigenvalue analysis techniques [NR’92]Karhunen-Loeve (K-L) transformSingular Value Decomposition (SVD)both need O(N2) time

FastMap [Faloutsos & Lin 95] dimensionality reduction and mapping of objects to vectorsO(N) time


Mathematical Preliminaries

For an nxn square matrix S, for unit vector x and scalar value λ: Sx = λxx: eigenvector of Sλ: eigenvalue of S

The eigenvectors of a symmetric matrix (S=ST) are mutually orthogonal and its eigenvalues are realr rank of a matrix: maximum number or

independent columns or rows


Example 1

Intuition: S defines an affine transform y = Sx that involves scaling, rotationeigenvectors: unit vectors along the new

directionseigenvalues denote scaling

52.0

85.0 ,38.3

85.0

52.0 ,62.3

31

12

22

11

u

u

S

eigenvector of major axis


Example 2

If S is real and symmetric (S=ST) then it can be written as S = UΛUT

the columns of U are eigenvectors of SU: column orthogonal (UUT=I)Λ: diagonal with the eigenvalues of S

52.085.0

85.052.0

38.10

062.3

52.085.0

85.052.0

31

12S


Karhunen-Loeve (K-L)

Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs)K-L gives a linear combination of axes sorted by importance keep the first k dims

2-dim points and the 2 K-L directionsfor k=1 keep x’


Computation of K-L

Put N vectors in rows in A=[aij]

Compute B=[aij-ap] , whereCovariance matrix: C=BTBCompute the eigenvectors of CSort in decreasing eigenvalue orderApproximate each object by its

projections on the directions of the first k eigenvectors

N

i ipp aNa1

1


Intuition

B shifts the origin of the center of gravity of the vectors by ap and has 0 column mean

C represents attribute to attribute similarityC square, real, symmetric Eigenvector and eigenvalues are computed

on C not on AC denotes the affine transform that

minimizes the error Approximate each vector with its

projections along the first k eigenvectors


Example

Input vectors [1 2], [1 1], [0 0]Then col.avgs are 2/3 and 1

00

11

21

A

0.47

0.88-u 13.0

88.0

47.0 u 53.2

21

13/2 and

13/2

03/1

13/1

22

11

CB


SVD

For general rectangular matrixesNxn matrix (N vectors, n dimensions)groups similar entities (documents) togetherGroups similar terms together and each group

of terms corresponds to a concept

Given an Nxn matrix A, write it as A = UΛVT

U: Nxr column orthogonal (r: rank of A)Λ: rxr diagonal matrix (non-negative, desc.

order)V: rxn column orthogonal matrix


SVD (cont,d)

A = λ1u1v1T + λ2u2v2

T + … + λrurvrT

u, v are column vectors of U, V SVD identifies rect. blobs of related values in A The rank r of A: number of blobs


Example

Two types of documents: CS and MedicalTwo concepts (groups of terms)

CS: data, information, retrievalMedical: brain, lung

Term/Document

data information retrieval brain lung

CS-TR1 1 1 1 0 0

CS-TR2 2 2 2 0 0

CS-TR3 1 1 1 0 0

CS-TR4 5 5 5 0 0

MED-TR1 0 0 0 2 2

MED-TR2 0 0 0 3 3

MED-TR3 0 0 0 1 1


71.071.0000

0058.058.058.0

29.50

064.9

27.00

80.00

53.00

090.0

018.0

036.0

018.0

A

Λ Vt

UExample (cont,d)

U: document-to-document similarity matrixV: term-to-document similarity matrix

v12 = 0: data has 0 similarity with the 2nd concept

r=2


SVD and LSI

SVD leads to “Latent Semantic Indexing” (http://lsi.research.telcordia.com/lsi/LSIpapers.html)

Terms that occur together are grouped into concepts

When a user searches for a term, the system determines the relevant concepts to search

LSI maps concepts to vectors in the concept space instead of the n-dim. document space

Concept space: is a lower dimensionality space

http://lsi.research.telcordia.com/lsi/LSIpapers.html



Examples of Queries Find documents with

the term “data” Translate query vector

q to concept space The query is related to

the CS concept and unrelated to the medical concept

LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query

0

0

0

0

1

q

0

58.0

0

0

0

0

1

71.071.0000

0058.058.058.0

qVq Tc


FastMap

Works with distances, has two roles:1. Maps objects to vectors so that

their distances are preserved (then apply SAMs for indexing)

2. Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible


Main idea

Pretend that objects are points in some unknown n-dimensional spaceproject these points on k mutually

orthogonal axescompute projections using distance only

The heart of FastMap is the method that projects two objects on a linetake 2 objects which are far apart (pivots)project on the line that connects the pivots


Project Objects on a Line

Oa, Ob: pivots, Oi: any object

dij: shorthand for D(Oi,Oj)

xi: first coordinate on a k dimensional space

If Oi is close to Oa, xi is small

ab

biabaii

abiabaiib

d

dddx

dxddd

2

2222

222

Apply cosine low:


Choose Pivots

Complexity: O(N) The optimal algorithm would require O(N2) time

steps 2,3 can be repeated 4-5 times to improve the accuracy of selection


Extension for Many Dimensions

Consider the (n-1)-dimensional hyperplane H that is perpendicular to line Oab

Project objects on H and apply previous stepchoose two new pivotsthe new xi is the next object coordinate repeat this step until k dim. vectors are

obtained

The distance on H is not D D’: distance between projected objects


Distance on the Hyper-Plane H

D’ on H can be computed from the Pythagorean theorem

The ability to compute D’ allows for computing a second line on H etc.

22 )(2)()(' jijiji xxOODOOD

Pythagorean theorem:


Algorithm


Observations

Complexity: O(kN) distance calculationsk: desired dimensionalityk recursive calls, each takes O(N)

The algorithm records pivots in each call (dimension) to facilitate queriesthe query is mapped to a k-dimensional vector

by projecting it on the pivot lines for each dimension

O(1) computation/step: no need to compute pivots


Observations (cont,d)

The projected vectors can be indexedmapping on 2-3 dimensions allows for

visualization of the data spaceAssumes Euclidean space (triangle

rules)not always true (at least after second step)

Approximation of pivotssome distances are negativeturn negative distances to 0


Application: Document Vectors

)),(1(2))cos(1(2

)2/sin(2),(tan

21

21

ddsimilarity

ddcedis


FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3


References

Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996

W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988

LSI website: http://lsi.research.telcordia.com/lsi/LSIpapers.html

C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995


Dimensionality Reduction

Documents

symmetric matrix s

symmetric s

column vectors of u

nxn square matrix s

unit vectors

eigenvector of s

lput n vectors

wherecovariance matrix