Dimensionality Reduction
Post on 29-Jan-2016
32 Views
Preview:
DESCRIPTION
Transcript
E.G.M. Petrakis Dimensionality Reduction 1
Dimensionality Reduction
Given N vectors in n dims, find the k most important axes to project themk is user defined (k < n)
Applications: information retrieval & indexingidentify the k most important features
or reduce indexing dimensions for faster
retrieval (low dim indices are faster)
E.G.M. Petrakis Dimensionality Reduction 2
Techniques
Eigenvalue analysis techniques [NR’92]Karhunen-Loeve (K-L) transformSingular Value Decomposition (SVD)both need O(N2) time
FastMap [Faloutsos & Lin 95] dimensionality reduction and mapping of objects to vectorsO(N) time
E.G.M. Petrakis Dimensionality Reduction 3
Mathematical Preliminaries
For an nxn square matrix S, for unit vector x and scalar value λ: Sx = λxx: eigenvector of Sλ: eigenvalue of S
The eigenvectors of a symmetric matrix (S=ST) are mutually orthogonal and its eigenvalues are realr rank of a matrix: maximum number or
independent columns or rows
E.G.M. Petrakis Dimensionality Reduction 4
Example 1
Intuition: S defines an affine transform y = Sx that involves scaling, rotationeigenvectors: unit vectors along the new
directionseigenvalues denote scaling
52.0
85.0 ,38.3
85.0
52.0 ,62.3
31
12
22
11
u
u
S
eigenvector of major axis
E.G.M. Petrakis Dimensionality Reduction 5
Example 2
If S is real and symmetric (S=ST) then it can be written as S = UΛUT
the columns of U are eigenvectors of SU: column orthogonal (UUT=I)Λ: diagonal with the eigenvalues of S
52.085.0
85.052.0
38.10
062.3
52.085.0
85.052.0
31
12S
E.G.M. Petrakis Dimensionality Reduction 6
Karhunen-Loeve (K-L)
Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs)K-L gives a linear combination of axes sorted by importance keep the first k dims
2-dim points and the 2 K-L directionsfor k=1 keep x’
E.G.M. Petrakis Dimensionality Reduction 7
Computation of K-L
Put N vectors in rows in A=[aij]
Compute B=[aij-ap] , whereCovariance matrix: C=BTBCompute the eigenvectors of CSort in decreasing eigenvalue orderApproximate each object by its
projections on the directions of the first k eigenvectors
N
i ipp aNa1
1
E.G.M. Petrakis Dimensionality Reduction 8
Intuition
B shifts the origin of the center of gravity of the vectors by ap and has 0 column mean
C represents attribute to attribute similarityC square, real, symmetric Eigenvector and eigenvalues are computed
on C not on AC denotes the affine transform that
minimizes the error Approximate each vector with its
projections along the first k eigenvectors
E.G.M. Petrakis Dimensionality Reduction 9
Example
Input vectors [1 2], [1 1], [0 0]Then col.avgs are 2/3 and 1
00
11
21
A
0.47
0.88-u 13.0
88.0
47.0 u 53.2
21
13/2 and
13/2
03/1
13/1
22
11
CB
E.G.M. Petrakis Dimensionality Reduction 10
SVD
For general rectangular matrixesNxn matrix (N vectors, n dimensions)groups similar entities (documents) togetherGroups similar terms together and each group
of terms corresponds to a concept
Given an Nxn matrix A, write it as A = UΛVT
U: Nxr column orthogonal (r: rank of A)Λ: rxr diagonal matrix (non-negative, desc.
order)V: rxn column orthogonal matrix
E.G.M. Petrakis Dimensionality Reduction 11
SVD (cont,d)
A = λ1u1v1T + λ2u2v2
T + … + λrurvrT
u, v are column vectors of U, V SVD identifies rect. blobs of related values in A The rank r of A: number of blobs
E.G.M. Petrakis Dimensionality Reduction 12
Example
Two types of documents: CS and MedicalTwo concepts (groups of terms)
CS: data, information, retrievalMedical: brain, lung
Term/Document
data information retrieval brain lung
CS-TR1 1 1 1 0 0
CS-TR2 2 2 2 0 0
CS-TR3 1 1 1 0 0
CS-TR4 5 5 5 0 0
MED-TR1 0 0 0 2 2
MED-TR2 0 0 0 3 3
MED-TR3 0 0 0 1 1
E.G.M. Petrakis Dimensionality Reduction 13
71.071.0000
0058.058.058.0
29.50
064.9
27.00
80.00
53.00
090.0
018.0
036.0
018.0
A
Λ Vt
UExample (cont,d)
U: document-to-document similarity matrixV: term-to-document similarity matrix
v12 = 0: data has 0 similarity with the 2nd concept
r=2
E.G.M. Petrakis Dimensionality Reduction 14
SVD and LSI
SVD leads to “Latent Semantic Indexing” (http://lsi.research.telcordia.com/lsi/LSIpapers.html)
Terms that occur together are grouped into concepts
When a user searches for a term, the system determines the relevant concepts to search
LSI maps concepts to vectors in the concept space instead of the n-dim. document space
Concept space: is a lower dimensionality space
E.G.M. Petrakis Dimensionality Reduction 15
Examples of Queries Find documents with
the term “data” Translate query vector
q to concept space The query is related to
the CS concept and unrelated to the medical concept
LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query
0
0
0
0
1
q
0
58.0
0
0
0
0
1
71.071.0000
0058.058.058.0
qVq Tc
E.G.M. Petrakis Dimensionality Reduction 16
FastMap
Works with distances, has two roles:1. Maps objects to vectors so that
their distances are preserved (then apply SAMs for indexing)
2. Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible
E.G.M. Petrakis Dimensionality Reduction 17
Main idea
Pretend that objects are points in some unknown n-dimensional spaceproject these points on k mutually
orthogonal axescompute projections using distance only
The heart of FastMap is the method that projects two objects on a linetake 2 objects which are far apart (pivots)project on the line that connects the pivots
E.G.M. Petrakis Dimensionality Reduction 18
Project Objects on a Line
Oa, Ob: pivots, Oi: any object
dij: shorthand for D(Oi,Oj)
xi: first coordinate on a k dimensional space
If Oi is close to Oa, xi is small
ab
biabaii
abiabaiib
d
dddx
dxddd
2
2222
222
Apply cosine low:
E.G.M. Petrakis Dimensionality Reduction 19
Choose Pivots
Complexity: O(N) The optimal algorithm would require O(N2) time
steps 2,3 can be repeated 4-5 times to improve the accuracy of selection
E.G.M. Petrakis Dimensionality Reduction 20
Extension for Many Dimensions
Consider the (n-1)-dimensional hyperplane H that is perpendicular to line Oab
Project objects on H and apply previous stepchoose two new pivotsthe new xi is the next object coordinate repeat this step until k dim. vectors are
obtained
The distance on H is not D D’: distance between projected objects
E.G.M. Petrakis Dimensionality Reduction 21
Distance on the Hyper-Plane H
D’ on H can be computed from the Pythagorean theorem
The ability to compute D’ allows for computing a second line on H etc.
22 )(2)()(' jijiji xxOODOOD
Pythagorean theorem:
E.G.M. Petrakis Dimensionality Reduction 22
Algorithm
E.G.M. Petrakis Dimensionality Reduction 23
Observations
Complexity: O(kN) distance calculationsk: desired dimensionalityk recursive calls, each takes O(N)
The algorithm records pivots in each call (dimension) to facilitate queriesthe query is mapped to a k-dimensional vector
by projecting it on the pivot lines for each dimension
O(1) computation/step: no need to compute pivots
E.G.M. Petrakis Dimensionality Reduction 24
Observations (cont,d)
The projected vectors can be indexedmapping on 2-3 dimensions allows for
visualization of the data spaceAssumes Euclidean space (triangle
rules)not always true (at least after second step)
Approximation of pivotssome distances are negativeturn negative distances to 0
E.G.M. Petrakis Dimensionality Reduction 25
Application: Document Vectors
)),(1(2))cos(1(2
)2/sin(2),(tan
21
21
ddsimilarity
ddcedis
E.G.M. Petrakis Dimensionality Reduction 26
FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3
E.G.M. Petrakis Dimensionality Reduction 27
References
Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996
W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988
LSI website: http://lsi.research.telcordia.com/lsi/LSIpapers.html
C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995
top related