SVD, SVD applications to LSA, non-negative matrix ...

9/28/2018

1

SVD, SVD applications to LSA, non-negative matrix

factorizations

Presented By: Sumedha Singla

Singular Value Decomposition (SVD)

9/28/2018

2


SVD of a matrix X

𝐗n ×d = 𝐔n ×n𝚺n ×d𝐕d ×dT or

𝐗n ×d = 𝐔n × k𝚺k ×k𝐕k ×dT

• 𝐗: A set of n points in ℝd with rank k• 𝐔 : Left Singular Vectors of 𝐗• 𝐕 : Right Singular Vectors of 𝐗• 𝚺: Rectangular diagonal matrix with positive real entries.

𝐗 = u1 … uk … un

σ1⋱

σk

v1T … vk

T … vdT

𝐗 = 𝐔𝚺𝐕T = u1σ1v1T + …+ ukσkvk

T =

i=1

k

uiσiviT


SVD of a matrix X

𝐗 𝐯𝐢 = σi𝐮𝐢• Finding an orthogonal basis for the row space that gets transformed into an

orthogonal basis for the column space.

• The columns of 𝐔 and 𝐕 are bases for the row and column spaces, respectively.

• 𝐔 and 𝐕 are orthonormal square matrix i.e𝐕𝐕T = 𝐕T𝐕 = 𝐈𝐔𝐔T = 𝐔T𝐔 = 𝐈

• Usually, 𝐔 ≠ 𝐕.

9/28/2018

3

Motivation

• Goal: Find the best k-dimensional subspace w.r.t 𝐗 (Project 𝐗 to ℝk where k < d)• minimize the sum of the squares of the perpendicular distances of the

points to the subspace

• Consider a set of 2d points. 𝐗n ×2, 𝐱i ∈ℝ2; 1 ≤ i ≤ n• Goal: Find the best fitting line through

origin w.r.t 𝐗• Here, k = 1• Best least square fit

• Minimize σαi2 or

• Maximize σβi2 i.e projection of 𝐱i

on subspace

• 𝐯: A unit vector in the direction of the best fitting line through origin w.r.t 𝐗

• βi = |xi . 𝐯|

• Best least square fit• Maximizing σβi

2 = |𝐗 . 𝐯|2

• First singular vector• 𝐯1 = argmax

𝐯 =1|𝐗 . 𝐯|

• First singular value• 𝜎1 = |𝐗 . 𝐯1|

• Greedy approach for subsequent singular vectors• Best fit line perpendicular to 𝐯1• 𝐯2 = arg max

𝐯⊥𝐯1, 𝐯 =1|𝐗 . 𝐯|

Singular Vectors 𝐗 𝐯𝐢 = σi 𝐮𝐢

9/28/2018

4

Intuitive Interpretation

A composition of three geometrical transformations: a rotation or reflection, a scaling, and another rotation or reflection.

𝐗 = 𝐔 𝚺 𝐕T

• Consider a unit circle𝐱′. 𝐱′ = 𝟏

• An ellipse of any size and orientation by stretching and rotating it.

• Consider 2-d points and fit an ellipse with major axis (a) and minor axes (b) to them.

• Consider,

𝐒 =a 00 b

, 𝐑 =cos θ sin θ− sin θ cos θ

• Any point can be transformed as𝐱′ = 𝐱 𝐑 𝐒−1

• The equation of unit circle𝐒−1𝐑T𝐱 . 𝐱 𝐑 𝐒−1 = 𝟏

Intuitive Interpretation 𝐗 = 𝐔 𝚺 𝐕T

9/28/2018

5

• Resulting matrix equation

𝐒−1𝐑T𝐗T𝐗𝐑𝐒−1 = 𝟏

• If we regard 𝐗 as a collection of points, then• The singular values are the axes of a least squares fitted ellipsoid

• 𝐕 is orientation of the ellipsoid.

• The matrix 𝐔 is the projection of each of the points in 𝐗 onto the axes.

Intuitive Interpretation 𝐗 = 𝐔 𝚺 𝐕T

• Natural Language Processing• Documents with 2 concepts:

• Computer Science (CS)• Medical Documents (MD)

SVD Example 𝐗n ×d = 𝐔n × k𝚺k ×k𝐕k ×dT

Term-Document MatrixRow: 1 DocumentColumns: 1 Term

Document-Concept Similarity Matrix

Row: 1 DocumentColumns: 1 Concept

Concept Strength MatrixRow: 1 Concept

Term-Concept MatrixRow: 1 ConceptColumn: 1 Term

9/28/2018

6

Eigen Vector

• An eigenvector of a square matrix 𝐗 is a nonzero vector 𝐯 such that multiplication by 𝐗 alters only the scale of 𝐯

𝐗𝐯 = λ𝐯• λ: Eigen value

• 𝐯: Unit Eigen vector

Eigen Value Decomposition

𝐗 = 𝐕 diag(𝛌)𝐕−1 where• Eigen vector matrix 𝐕 = [v1, … , vn]

• Diagonal matrix 𝛌 = λ1, … , λnMore general form

𝐗 = 𝐐 𝜦 𝑸𝑇

Eigen value decomposition 𝐗 = 𝐔 𝚺 𝐕T

• Eigen value decomposition: 𝐗 = 𝐐 𝜦 𝑸𝑇

• 𝐗 needs • orthonormal eigen vectors to allow 𝐔 = 𝐕 = 𝐐.

• Eigenvalues 𝜆 ≥ 0 if 𝜦 = 𝚺.

• Hence, 𝐗 must be a positive semi-definte (or definite) symmetric matrix.

• Eigen value decomposition is a special case of SVD.

When is singular values same as eigen values


9/28/2018

7

Rather than solving for U, V and Σ simultaneously, we multiply both sides by𝐗T = 𝑽 𝚺T 𝑼T

𝐗T𝐗 = (𝐔 𝚺 𝐕T )T (𝐔 𝚺 𝐕T )

= 𝑽 𝚺T 𝑼T𝐔 𝚺 𝐕T

= 𝑽 𝚺T 𝚺 𝐕T

= 𝑽 𝚺2 𝐕T

This is the form of eigen value decomposition. 𝐗 = 𝐐 𝜦 𝑸𝑇

𝐕: The eigen vectors of 𝐗T𝐗.

𝚺T 𝚺: The eigen value matrix of 𝐗T𝐗.𝜎𝑖 = λi

U: The eigen vectors of 𝐗𝐗T.

Calculating SVD using Eigen value decomposition


We know that,

𝐮𝑖𝑇 𝐮j =

𝐗 𝐯i

σi

T 𝐗 𝐯j

σj

𝐮𝑖𝑇 𝐮j =

𝐯𝑖𝑇 𝐗T𝐗𝐯j

σi σj=

𝐗T𝐗

σi σj𝐯𝑖𝑇𝐯j = 0

𝐔: The orthonormal eigen vectors of 𝐗 𝐗T.

We can thus write,

𝐗 𝐗T𝐔 = 𝐔 𝚺𝟐

SVD and Eigen value decomposition 𝐗 𝐯𝐢 = σi 𝐮𝐢

9/28/2018

8

• Consider 𝐗 =4 4−3 3

• Compute 𝐗T𝐗 =4 −34 3

4 4−3 3

=25 77 25

• Orthogonal Eigen vector of 𝐗T𝐗

• 𝐯1 =1/ 2

1/ 2and 𝐯2 =

1/ 2

−1/ 2

• Eigen values of 𝐗T𝐗• 𝜎1

2 = 32 and 𝜎22 = 18

• We have,

4 4−3 3

=4 2 0

0 3 2

1/ 2 1/ 2

1/ 2 −1/ 2

Example SVD

• Consider 𝐗 =4 4−3 3

• Compute 𝐗 𝐗T =4 4−3 3

4 −34 3

=32 00 18

• Orthogonal Eigen vector of 𝐗 𝐗T

• 𝐮1 =10

and 𝐮2 =0−1

• We have,

4 4−3 3

=1 00 −1

4 2 0

0 3 2

1/ 2 1/ 2

1/ 2 −1/ 2

Example SVD

9/28/2018

9

• An eigen-decomposition is valid only for square matrix. Any matrix (even rectangular) has an SVD.

• In eigen-decomposition 𝐗 = 𝐐 𝜦 𝑸𝑇 , the eigen-basis (𝐐) is not always orthogonal. The basis of singular vectors is always orthogonal.

• In SVD we have two singular-spaces (right and left).

• Computing the SVD of a matrix is more numerically stable.

SVD vs Eigen Decomposition

• The covariance matrix of 𝐗 is given by

𝐂𝐨𝐯 = 𝐗T 𝐗/(𝐧 − 𝟏)

• The eigen value decomposition of 𝐂𝐨𝐯 matrix𝐂𝐨𝐯 = 𝐐 𝜦 𝑸𝑇

Where,

𝐐 is a matrix of eigenvectors of 𝐂𝐨𝐯 or principal axes of 𝐗

𝜦 is a diagonal matrix with eigenvalues λi in the decreasing order on the diagonal.

SVD and PCA 𝐗 = 𝐔 𝚺 𝐕T

9/28/2018

10

• We can rewrite covariance matrix of 𝐗 as

𝐂𝐨𝐯 = 𝐗T 𝐗/(𝐧 − 𝟏)𝐂𝐨𝐯 = 𝐕 𝚺 𝐔T𝐔 𝚺 𝐕T/(𝐧 − 𝟏)

= 𝐕𝚺2

(𝑛 − 1)𝐕T

• Right singular vector 𝐕 is the principal axes

• λi = Τσi2 (n − 1)

• 𝐗 𝐕 = 𝐔 𝚺 𝐕T𝐕 = 𝐔 𝚺

• The columns of 𝐔 𝚺 are the principal components.

SVD and PCA 𝐗 = 𝐔 𝚺 𝐕T

• Input Data: 𝐗n ×d• Goal: Reduce the dimensionality to k where k < d

• Select k first columns of 𝐔, and k × k upper-left part of 𝚺

• Construct 𝐁 = 𝐔k 𝚺k × k

• 𝐁 is the required n × k matrix containing first k PCs.

SVD for dimensionality reduction 𝐗 = 𝐔 𝚺 𝐕T

9/28/2018

11

The best approximation to 𝐗 by a rank deficient matrix is obtained by the top singular values and vectors of 𝐗.

𝐗k =

i=1

k

uiσiviT

Then,min

𝐁 ∈ℝn ×d rank 𝐁 ≤k𝐗 − 𝐁 2 = 𝐗 − 𝐗k 2 = σk+1

σk+1 is the largest singular value of 𝐗 − 𝐗k.

𝐗k is the best rank k 2-norm approximation of 𝐗.

Rank-k approximation in the spectral norm

𝐗 =

i=1

𝑑

uiσiviT

• Determining range, null space and rank (also numerical rank).

• Matrix approximation.

• Inverse and Pseudo-inverse: • If 𝐗 = 𝐔 𝚺 𝐕T and 𝚺 is full rank, then 𝐗−1 = 𝐕 𝚺−1𝐔T.

• If 𝚺 is singular, then its pseudo-inverse is given by 𝐗† = 𝐕 𝚺†𝐔T , where

𝚺† is formed by replacing every nonzero entry by its reciprocal.

• Least squares: • If we need to solve 𝐀𝐱 = b in the least-squares sense, then 𝐱LS =

𝐕 𝚺†𝐔T b

• Denoising – Small singular values typically correspond to noise.

Applications of SVD

9/28/2018

12

• Input matrix: term-document matrices• Rows: represents words.

• Columns: represents documents.

• Value: the count of the words in the document.

• Example:

Latent Semantic Analysis using SVD

𝐗 =

1 1 1 0 0 00 0 0 0 1 00 1 0 0 0 01 1 1 1 1 10 1 0 0 1 0

• Consider 𝐗, the term-document matrix.

• Then, • 𝐔 is the SVD term matrix

• 𝐕 is the SVD document matrix

• SVD provides a low rank approximation for 𝐗.

• Constrained optimization problem• Goal: Represent 𝐗 as 𝐗k with low Frobenius norm for the error 𝐗 - 𝐗k

Latent Semantic Indexing (LSI) 𝐗n ×d = 𝐔n × k𝚺k ×k𝐕k ×dT

9/28/2018

13



k = 2

We can get rid of zero valued columns and rowsAnd have a 2 x 2 concept strength matrix

We can get rid of zero valued columnsAnd have a 5 x 2 term-to-concept similarity matrix

We can get rid of zero valued columnsAnd have a 2 x 6concept-to-doc similarity matrix

9/28/2018

14

Latent Semantic Indexing (LSI)

30.02dim

44.01dimr

Q

0

0

0

0

1cos

truck

car

moon

astronaut

monaut

Q

0)2,cos( dQ 88.0)2,cos( dQr

Original space Reduced latent semantic space

We see that query is not related to document 2 in the original space but in the latent semantic space they become highly related.

• SVD allow words and documents to be mapped into the same "latent semantic space“.

• LSI projects queries and documents into a space with latent semantic dimensions.• Co-occurring words are projected on the same dimensions

• Non-co-occurring words are projected onto different dimensions

• LSI captures similarities between words• For example, we want to project “car” and “automobile” onto the same

dimension.

• Dimensions of the reduced semantic space correspond to the axes of greatest variation in the original space.

Latent Semantic Indexing (LSI)

9/28/2018

15

• Extracting information from link structures of a hyperlinked environment, rank pages relevant to a topic

• Essentials:• Authorities

• Hubs

• Goal: Identify good authorities and hubs for a topic.

• Each page receive two scores, • Authority score 𝐴(𝑝): It estimates value of content on page

• Hub score 𝐻(𝑝): It estimates value of links on page

Kleinberg’s Algorithm Hyperlink-Induced Topic Search (HITS)aka ‘hubs and authorities’

• For a topic, authorities are relevant nodes which are referred by many hubs. (high in degree)

• For a topic, hubs are nodes which connect many related authorities for that topic. (high out degree)

Authorities and Hubs

9/28/2018

16

• Three Steps1. Create a focused base-set of the Web.

• Start with a root set.

• Add any page pointed by a page in the root set to it.

• Add any page that points to a page in the root set to it (at most d).

• The extended root set becomes our base set.

2. Iteratively compute hub and authority scores.• A(p): sum of H q for all q pointing to p.

• H(q): sum of A p for all p pointing to q.

• Starts with all scores as 1, and Iteratively repeat till convergence.

3. Filter out the top hubs and authorities

HITS (cont.)

• G (root set) is a directed graph with web pages as nodes and their links.

• G can be presented as a connectivity matrix A• A(i,j)=1 only if i-th page points to j-th page.

• Authority weights can be represented as a unit vector a• ai The authority weight of the i-th page

• Hub weights can be represented as a unit vector h• hi : The hub weight of the i-th page

Matrix Notation

9/28/2018

17

• Updating authority weights: a = ATh

• Updating hub weights:h = Aa

• After k iterations:a1 = ATh0h1 = Aa1

→ h1 = AATh0→ hk = (AAT)kh0

• Convergence• ak: Converges to principal eigen vector of ATA• hk: Converges to principal eigen vector of AAT

Algorithm

Given A ∈ ℝ+n × d and a desired rank k ≪ min(n, d),

Find W ∈ ℝ+n × k and H ∈ ℝ+

k ×n s.t. A≈ WH.

• minW≥0,H ≥0

||A −WH||F

• Nonconvex.

• W and H not unique ( e.g. W = WD ≥ 0, H = D−1H ≥ 0)

Notation: ℝ+ nonnegative real numbers

Nonnegative Matrix Factorization (NMF)

9/28/2018

18

• SVD gives: A = UΣVT

• Then, A − UΣVT F ≤ min A −WH F

• Then WHY NMF???

• NMF works better in terms of its non-negativity constraints. Example in • Text mining. (A is represented as counts, so is strictly

positive.)

Nonnegative Matrix Factorization (NMF)

• https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/positive-definite-matrices-and-applications/singular-value-decomposition/MIT18_06SCF11_Ses3.5sum.pdf

• https://archive.siam.org/meetings/sdm11/park.pdf

References

https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/positive-definite-matrices-and-applications/singular-value-decomposition/MIT18_06SCF11_Ses3.5sum.pdf

https://archive.siam.org/meetings/sdm11/park.pdf