STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture5.pdf · STA141C: Big Data & High Performance Statistical Computing Lecture 5:

STA141C: Big Data & High Performance StatisticalComputing

Lecture 5: Numerical Linear Algebra

Cho-Jui HsiehUC Davis

April 20, 2017

Linear Algebra Background

Vectors

A vector has a direction and a “magnitude” (norm)

Example (2-norm):

x = [1, 2]T , ‖x‖2 =√

12 + 22 =√

5

Properties satisfied by a vector norm (‖ · ‖)‖x‖ ≥ 0 and ‖x‖ = 0 if and only if x = 0‖αx‖ = |α|‖x‖ (homogeneity)‖x + y‖ ≤ ‖x‖+ ‖y‖ (triangle inequality)

Examples of vector norms

x = [x1, x2, · · · , xn]T

‖x‖2 =√|x1|2 + |x2|2 + · · ·+ |xn|2 (2-norm)

‖x‖1 = |x1|+ |x2|+ · · ·+ |xn| (1-norm)

‖x‖p = p√|x1|p + |x2|p + · · ·+ |xn|p (p-norm)

‖x‖∞ = max1≤i≤n |xi | (∞-norm)

Distances

x = [1, 2], y = [2, 1]

x − y = [1− 2, 2− 1] = [−1, 1]

Distance:

‖x − y‖2 = ‖[−1, 1]‖2 =√

2‖x − y‖1 = 2‖x − y‖∞ = 1

Metrics

Metrics: d(x , y) will satisfy the following properties:

d(x , y) ≥ 0, and d(x , y) = 0 if and only if x = yd(x , y) = d(y, x)d(x , z) ≤ d(x , y) + d(y, z) (triangle inequality)

Is d(x , y) = ‖x − y‖ a valid metric if ‖ · ‖ is a norm?

Yes.

d(x , z) = ‖x − z‖ = ‖(x − y) + (y − z)‖≤ ‖x − y‖+ ‖y − z‖ = d(x , y) + d(y, z)

Metrics

Metrics: d(x , y) will satisfy the following properties:

d(x , y) ≥ 0, and d(x , y) = 0 if and only if x = yd(x , y) = d(y, x)d(x , z) ≤ d(x , y) + d(y, z) (triangle inequality)

Is d(x , y) = ‖x − y‖ a valid metric if ‖ · ‖ is a norm?

Yes.

d(x , z) = ‖x − z‖ = ‖(x − y) + (y − z)‖≤ ‖x − y‖+ ‖y − z‖ = d(x , y) + d(y, z)

Inner Product between Vectors

x = [x1, · · · , xn]T , y = [y1, , · · · , yn]T

Inner product:

xTy = x1y1 + x2y2 + · · ·+ xnyn =n∑

i=1

xiyi

xTx = ‖x‖22, ‖x − y‖22 = (x − y)T (x − y)

Orthogonal:

x ⊥ y ⇔ xTy = 0

(x and y are orthogonal to each other)

Projection onto a vector

xTy = ‖x‖2‖y‖2 cos(θ)

cos θ =xTy

‖x‖2‖y‖2⇒ ‖x‖2 cos θ = xT (

y

‖y‖2) = xT y

Projection of x onto y:

‖x‖2 cos θ · y = yyT︸︷︷︸projection matrix

x

Linearly independent

Suppose we have 3 vectors x1, x2, x3

x1 = α2x2 + α3x3 ⇒ x1 is linearly dependent on x2 and x3

When are x1, · · · , xn linearly independent?

α1x1 + α2x2 + · · ·+ αnxn = 0 if and only if α1 = α2 = · · · = αn = 0

A vector space is a set of vectors that is closed under vector addition &scalar multiplications.

If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V

A basis of the vector space is the maximal set of vectors in thesubspace that are linearly independent of each other.

An orthogonal basis is a basis where all basis vectors are orthogonal toeach other.

Dimension of the vector space: number of vectors in the basis.







If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V










If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V










If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V










If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V




Matrices

An m by n matrix: A ∈ Rm×n:a11 a12 · · · a1n...

.... . .

...am1 am2 · · · amn

A matrix is also a linear transform:

A : Rm → Rn

x → Axαx + βy→ A(αx + βy) = αAx + βAy (Linear Transform)

Matrix Norms

Popular matrix norm: Frobenius norm

‖A‖F =

√√√√ m∑i=1

n∑j=1

|aij |2

Matrix norms satisfy following properties:

‖A‖ ≥ 0 and ‖A‖ = 0 if and only if A = 0 (positivity)‖αA‖ = |α|‖A‖ (homogeneity)‖A + B‖ ≤ ‖A‖+ ‖B‖

Induced Norm

Given a vector norm ‖ · ‖, we can define the corresponding inducednorm or operator norm by

‖A‖ = supx{‖Ax‖ : ‖x‖ = 1}

= supx{‖Ax‖‖x‖

: x 6= 0}

Induced p-norm:

‖A‖p = supx 6=0

‖Ax‖p‖x‖p

Examples:

‖A‖2 = supx 6=0‖Ax‖2‖x‖2 = σmax(A) (induced 2-norm is the maximum

singular value)‖A‖1 = maxj

∑mi=1 |aij |

‖A‖∞ = maxi∑n

j=1 |aij |

Rank of a matrix

Column rank of A: the dimension of column space (vector spaceformed by column vectors)

Row rank of A: the dimension of row space

Column rank = row rank := rank (always true)

Examples:

Rank 2 matrix: 1 0 1−2 −3 13 3 0

=

1 0−2 −33 3

[1 0 10 1 −1

]

Rank 1 matrix:[1 1 0 2−1 −1 0 −2

]=

[1−1

] [1 1 0 2

]

QR decomposition

Matrix Decomposition: A = WH





QR Decomposition

Any m-by-n matrix A can be factorized into

A = QR

Q is an m-by-m unitary matrix (QTQ = I ), which is a basis for Rm.

R is an m-by-n triangular matrix.

If n < m, then the decomposition will be

A = [Q1Q2]

[R1

0

]= Q1R1

where Q1 ∈ Rm×n, R1 ∈ Rn×n.

This is the “economic” form of QR decomposition.

Full QR:

Economic QR:

Computing the QR decomposition

Given A ∈ Rm×n

a1 = q1R11 and ‖q1‖ = 1


Given A ∈ Rm×n

a1 = q1R11 and ‖q1‖ = 1

q1 = a1/‖a1‖,R11 = ‖a1‖


Given A ∈ Rm×n

a2 = q1R12 + q2R22 and ‖q2‖ = 1,qT2 q1 = 0


Given A ∈ Rm×n

a2 = q1R12 + q2R22 and ‖q2‖ = 1,qT2 q1 = 0

R12 = qT1 a2


Given A ∈ Rm×n

a2 = q1R12 + q2R22 and ‖q2‖ = 1,qT2 q1 = 0

R12 = qT1 a2

R22q2 = a2 − q1R12 := u2

⇒ q2 = u2/‖u2‖,R22 = ‖u2‖


Given A ∈ Rm×n

a3 = q1R13 + q2R23 + q3R33 and ‖q3‖ = 1,qT3 q1 = 0,qT

3 q2 = 0


Given A ∈ Rm×n


3 q2 = 0

R13 = qT1 a3


Given A ∈ Rm×n


3 q2 = 0

R13 = qT1 a3

R23 = qT2 a3


Given A ∈ Rm×n


3 q2 = 0

R13 = qT1 a3

R23 = qT2 a3

R33q3 = a3 − q1R13 − q2R23 := u3

⇒ q3 = u3/‖u3‖,R33 = ‖u3‖

Time Complexity

O(mn2) time (assume n ≤ m)

Very fast (the constant associated with time complexity is very small)

Can be used to

Find the orthogonal basis of a matrixCompute the rank of a matrixPre-process data for solving linear systems

QR Decomposition in python

Full QR

>>> from scipy import random, linalg

>>> A = random.rand(4, 2)

>>> q,r = linalg.qr(A)

>>> q.shape

(4, 4)

>>> r.shape

(4, 2)

Economic QR

>>> q,r = linalg.qr(A, mode=’economic’)

>>> q.shape

(4, 2)

>>> r.shape

(2, 2)

Singular Value Decomposition

Low-rank Approximation

Big Matrix A: Want to approximate by “simple” matrix A

Goodness of approximation of A can be measured by

‖A− A‖

(e.g., using ‖ · ‖F )

Low-rank approximation: A ≈ BC

Important Question

Given A and k , what is the best rank-k approximation?

minB,C are rank k

‖A− BC‖F

Minimizing B and C are obtained by SVD (Singular ValueDecomposition) of A


Any real matrix A ∈ Rm×n has the following Singular ValueDecomposition (SVD):

A = U Σ V T

U is a m ×m unitary matrix (UTU = I ), columns of U are orthogonalV is a n × n unitary matrix (V TV = I ), columns of V are orthogonalΣ is a diagonal m × n matrix with non-negative real numbers on thediagonal.Usually, we assume the diagonal numbers are organized in descendingorder:

Σ = diag(σ1, σ2, · · · , σn),

σ1 ≥ σ2 ≥ · · · ≥ σn


u1,u2, · · · ,um: left singular vectors, basis of column space

uTi uj = 0 ∀i 6= j , uT

i ui = 1 ∀i = 1, · · · ,mv1, v2, · · · , vn: right singular vectors, basis of row space

vTi vj = 0 ∀i 6= j , vT

i vi = 1 ∀i = 1, · · · , nSVD is unique (up to permutations, rotations)

Linear Transformations

Matrix A is a linear transformation

For right singular vectors v1, · · · , vn, what are the vectors aftertransformation?

A = UΣV T ⇒ AV = UΣV TV = UΣ

Therefore,Avi = σiui , i = 1, · · · , n


Matrix A is a linear transformation

For right singular vectors v1, · · · , vn, what are the vectors aftertransformation?

A = UΣV T ⇒ AV = UΣV TV = UΣ

Therefore,Avi = σiui , i = 1, · · · , n


For left singular vectors, we can derive the similar transformations usingAT .

AT = VΣTUT ⇒ ATU = VΣT

Therefore,ATui = viσi , i = 1, · · · ,m

Geometry of SVD

Given an input vector x , how do we define the linear transform Ax?

Assume A = UΣV T is the SVD, then Ax is:

Step 1: Represent x as

x = a1v1 + · · ·+ anvn, (ai = vTi a)

Step 2: Transform vi → σiui for all i :

Ax = a1σ1u1 + · · ·+ anσnvn

Reduced SVD (Thin SVD)

If m > n, then only first n left singular vectors u1, · · · ,un are useful:

Therefore, when m > n, we often just use the following “thin SVD”:

A = UΣV T where U ∈ Rm×n,Σ,V ∈ Rn×n:

U,V are unitary, and Σ is diagonal

Best Rank-k approximation

Given a matrix A, how to form the best rank k approximation?

arg minX :rank(X )=k

‖X − A‖ or arg minW∈Rn×k ,H∈Rm×k

‖X −WHT‖

Solution: truncated SVD:

A ≈ UrΣrVTr ,

where Ur ,Vr ,Σr are the first r columns of SVD matrices

Why? Keep the k-most important components in SVD

Application: Data Approximation/Compression

Given a data matrix A ∈ Rm×n

Approximate the data matrix by the best rank-k approximation:

A ≈WHT

Compress the data: O(mn) memory ⇒ O(mk + nk) memory

De-noising (sometimes)

Example: if each column of A is a data, then wi (i-th column of W ) is“dictionary”, and H is the coefficient:

ai ≈k∑

j=1

Hijwj

Eigenvalue Decomposition

Eigen Decomposition

For a n by n matrix H, ifHy = λy,

then we say

λ is an eigenvalue of Hy is the corresponding eigenvector

Eigen Decomposition

Consider A ∈ Rn×n to be a square, symmetric matrix. The eigenvaluedecomposition of A is:

A = VΛV T , V TV = I (V is unitary), Λ is diagonal

A = VΛV T ⇒ AV = VΛ

⇒ Avi = λivi , ∀i = 1, · · · , nEach vi is an eigenvector, and each λi is an eigenvalue

Usually, we assume the diagonal numbers are organized in descendingorder:

Λ = diag(λ1, λ2, · · · , λn),

λ1 ≥ λ2 ≥ · · · ≥ λn

Eigenvalue decomposition is unique when there are n uniqueeigenvalues.

Eigen Decomposition

Consider A ∈ Rn×n to be a square, symmetric matrix. The eigenvaluedecomposition of A is:

A = VΛV T , V TV = I (V is unitary), Λ is diagonal

A = VΛV T ⇒ AV = VΛ

⇒ Avi = λivi , ∀i = 1, · · · , nEach vi is an eigenvector, and each λi is an eigenvalue

Usually, we assume the diagonal numbers are organized in descendingorder:

Λ = diag(λ1, λ2, · · · , λn),

λ1 ≥ λ2 ≥ · · · ≥ λn

Eigenvalue decomposition is unique when there are n uniqueeigenvalues.

Eigen Decomposition

Each eigenvector vi will be mapped to Avi = λvi after the lineartransform:

Scaling without changing the direction of eigenvectors

Ax =∑m

i=1 λivi (vTi x)

Project x to eigenvectors, and then scaling each vector

Eigen Decomposition

How to compute SVD/eigen decomposition?

Eigen Decomposition & SVD

SVD can be transformed to eigenvalue decomposition!

Given A ∈ Rm×n, and A = UΣV T (SVD)

We have

AAT = UΣV TVΣUT = UΣ2UT (eigen decomposition of AAT )

After getting U,Σ, we can compute V by

V T = Σ−1UTA

ATA = VΣ2V T : Eigen decomposition of AAT

After getting V ,Σ, we can compute U = AVΣ−1




We have



V T = Σ−1UTA






We have



V T = Σ−1UTA




Which one should we use? eigenvalue decomposition of AAT or ATA?

Depends on dimensionality m vs n

If A ∈ Rm×n and m > n:

Step 1: Compute the eigenvalue decomposition of ATA = VΣ2V T

Step 2: Compute U = AVΣ−1

If A ∈ Rm×n and m < n:

Step 1: Compute the eigenvalue decomposition of AAT = UΣ2UT

Step 2: Compute V = ATUΣ−1


Which one should we use? eigenvalue decomposition of AAT or ATA?

Depends on dimensionality m vs n

If A ∈ Rm×n and m > n:

Step 1: Compute the eigenvalue decomposition of ATA = VΣ2V T

Step 2: Compute U = AVΣ−1

If A ∈ Rm×n and m < n:

Step 1: Compute the eigenvalue decomposition of AAT = UΣ2UT

Step 2: Compute V = ATUΣ−1

Computing Eigenvalue Decomposition

If A is a dense matrix:

Call scipy.linalg.eig

If A is a (large) sparse matrix: usually we only compute compute top-keigenvectors with a small k

Power methodLanczos algorithm (implemented in build-in packages)Call scipy.sparse.linalg.eigs

Dense Eigenvalue Decomposition

Only for dense matrix

Will compute all the eigenvalues and eigenvectors

>>> import numpy as np

>>> from scipy import linalg

>>> A = np.array([[1, 2], [3, 4]])

>>> S, V = linalg.eig(A)

>>> S

array([-0.37228132+0.j, 5.37228132+0.j])

>>> V

array([[-0.82456484, -0.41597356],

[ 0.56576746, -0.90937671]])

Dense SVD

Only for dense matrix

Specify “full matrices=False” for the thin SVD

>>> A = np.array([[1,2],[3,4], [5,6]])

>>> U, S, V = linalg.svd(A)

>>> U

array([[-0.2298477 , 0.88346102, 0.40824829],

[-0.52474482, 0.24078249, -0.81649658],

[-0.81964194, -0.40189603, 0.40824829]])

>>> s

array([ 9.52551809, 0.51430058])

>>> U, S, V = linalg.svd(A, full_matrices=False)

>>> U

array([[-0.2298477 , 0.88346102],

[-0.52474482, 0.24078249],

[-0.81964194, -0.40189603]])

Sparse Eigenvalue Decomposition

Usually only compute the top-k eigenvalues and eigenvectors (since theU,V will be dense)

Use iterative methods to compute Uk (top-k eigenvectors):

Initialize Uk by a n-by-k random matrix U(0)k

Iteratively update Uk :

U(0)k → U

(1)k → U

(2)k → · · ·

Converge to the SVD solution:

limt→∞

U(t)k = Uk

Power Iteration for top eigenvector

Main idea: compute (ATv)/‖ATv‖ for large T .

ATv/‖ATv‖ → top eigen vector when T →∞


Input: A ∈ Rd×d , number of iterations T

Initialize a random vector v ∈ Rd

For t = 1, 2, · · · ,Tv ← Avv ← v/‖v‖

Output v (top eigenvector)


Main idea: compute (ATv)/‖ATv‖ for large T .

ATv/‖ATv‖ → top eigen vector when T →∞




For t = 1, 2, · · · ,Tv ← Avv ← v/‖v‖






For t = 1, 2, · · · ,Tv ← Avv ← v/‖v‖


Top eigenvalue is vTAv (= vTλv = λ‖v‖2 = λ)

Only need one matrix-vector product at each iteartion

Time complexity: O(nnz(A)) per iteration

If A = XXT , then Av = X (XTv)

No need for explicitly forming A

Power Iteration

If A = UΛUT is the eigenvalue decomposition of A

What is the eigenvalue decomposition of At?

At = (UΛUT )(UΛUT ) · · · (UΛUT ) = U

λt1 0 · · · 00 λt2 · · · 0...

.... . .

...0 0 · · · λtd

UT

Atv =d∑

i=1

λti (uTi v)ui ∝ u1 + (

λ2λ1

)t(uT2 v

uT1 v

)u2 + · · ·+ (λdλ1

)t(uTd v

uT1 v

)ud

If v is a random vector, then uT1 v 6= 0 with probability 1, so

Atv‖Atv‖

→ u1 as t →∞

Power iteration for top k eigenvectors


Input: A ∈ Rd×d , number of iterations T , rank k

Initialize a random matrix V ∈ Rd×k

For t = 1, 2, · · · ,TV ← AV[Q,R]← qr(V )V ← Q

S ← V TAV

[U, S ]← eig(S)

Output eigenvectors U = VU, and eigenvalues S

Sometimes faster than default functions (linalg.eigs and linalg.svds), sincethey aim to compute very accurate solutions.

Coming up

Numerical Linear Algebra for Machine Learning

Questions?

STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture5.pdf · STA141C: Big Data & High Performance Statistical Computing Lecture 5:

Documents