STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017
STA141C: Big Data & High Performance StatisticalComputing
Lecture 5: Numerical Linear Algebra
Cho-Jui HsiehUC Davis
April 20, 2017
Linear Algebra Background
Vectors
A vector has a direction and a “magnitude” (norm)
Example (2-norm):
x = [1, 2]T , ‖x‖2 =√
12 + 22 =√
5
Properties satisfied by a vector norm (‖ · ‖)‖x‖ ≥ 0 and ‖x‖ = 0 if and only if x = 0‖αx‖ = |α|‖x‖ (homogeneity)‖x + y‖ ≤ ‖x‖+ ‖y‖ (triangle inequality)
Examples of vector norms
x = [x1, x2, · · · , xn]T
‖x‖2 =√|x1|2 + |x2|2 + · · ·+ |xn|2 (2-norm)
‖x‖1 = |x1|+ |x2|+ · · ·+ |xn| (1-norm)
‖x‖p = p√|x1|p + |x2|p + · · ·+ |xn|p (p-norm)
‖x‖∞ = max1≤i≤n |xi | (∞-norm)
Distances
x = [1, 2], y = [2, 1]
x − y = [1− 2, 2− 1] = [−1, 1]
Distance:
‖x − y‖2 = ‖[−1, 1]‖2 =√
2‖x − y‖1 = 2‖x − y‖∞ = 1
Metrics
Metrics: d(x , y) will satisfy the following properties:
d(x , y) ≥ 0, and d(x , y) = 0 if and only if x = yd(x , y) = d(y, x)d(x , z) ≤ d(x , y) + d(y, z) (triangle inequality)
Is d(x , y) = ‖x − y‖ a valid metric if ‖ · ‖ is a norm?
Yes.
d(x , z) = ‖x − z‖ = ‖(x − y) + (y − z)‖≤ ‖x − y‖+ ‖y − z‖ = d(x , y) + d(y, z)
Metrics
Metrics: d(x , y) will satisfy the following properties:
d(x , y) ≥ 0, and d(x , y) = 0 if and only if x = yd(x , y) = d(y, x)d(x , z) ≤ d(x , y) + d(y, z) (triangle inequality)
Is d(x , y) = ‖x − y‖ a valid metric if ‖ · ‖ is a norm?
Yes.
d(x , z) = ‖x − z‖ = ‖(x − y) + (y − z)‖≤ ‖x − y‖+ ‖y − z‖ = d(x , y) + d(y, z)
Inner Product between Vectors
x = [x1, · · · , xn]T , y = [y1, , · · · , yn]T
Inner product:
xTy = x1y1 + x2y2 + · · ·+ xnyn =n∑
i=1
xiyi
xTx = ‖x‖22, ‖x − y‖22 = (x − y)T (x − y)
Orthogonal:
x ⊥ y ⇔ xTy = 0
(x and y are orthogonal to each other)
Projection onto a vector
xTy = ‖x‖2‖y‖2 cos(θ)
cos θ =xTy
‖x‖2‖y‖2⇒ ‖x‖2 cos θ = xT (
y
‖y‖2) = xT y
Projection of x onto y:
‖x‖2 cos θ · y = yyT︸︷︷︸projection matrix
x
Linearly independent
Suppose we have 3 vectors x1, x2, x3
x1 = α2x2 + α3x3 ⇒ x1 is linearly dependent on x2 and x3
When are x1, · · · , xn linearly independent?
α1x1 + α2x2 + · · ·+ αnxn = 0 if and only if α1 = α2 = · · · = αn = 0
A vector space is a set of vectors that is closed under vector addition &scalar multiplications.
If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V
A basis of the vector space is the maximal set of vectors in thesubspace that are linearly independent of each other.
An orthogonal basis is a basis where all basis vectors are orthogonal toeach other.
Dimension of the vector space: number of vectors in the basis.
Linearly independent
Suppose we have 3 vectors x1, x2, x3
x1 = α2x2 + α3x3 ⇒ x1 is linearly dependent on x2 and x3
When are x1, · · · , xn linearly independent?
α1x1 + α2x2 + · · ·+ αnxn = 0 if and only if α1 = α2 = · · · = αn = 0
A vector space is a set of vectors that is closed under vector addition &scalar multiplications.
If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V
A basis of the vector space is the maximal set of vectors in thesubspace that are linearly independent of each other.
An orthogonal basis is a basis where all basis vectors are orthogonal toeach other.
Dimension of the vector space: number of vectors in the basis.
Linearly independent
Suppose we have 3 vectors x1, x2, x3
x1 = α2x2 + α3x3 ⇒ x1 is linearly dependent on x2 and x3
When are x1, · · · , xn linearly independent?
α1x1 + α2x2 + · · ·+ αnxn = 0 if and only if α1 = α2 = · · · = αn = 0
A vector space is a set of vectors that is closed under vector addition &scalar multiplications.
If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V
A basis of the vector space is the maximal set of vectors in thesubspace that are linearly independent of each other.
An orthogonal basis is a basis where all basis vectors are orthogonal toeach other.
Dimension of the vector space: number of vectors in the basis.
Linearly independent
Suppose we have 3 vectors x1, x2, x3
x1 = α2x2 + α3x3 ⇒ x1 is linearly dependent on x2 and x3
When are x1, · · · , xn linearly independent?
α1x1 + α2x2 + · · ·+ αnxn = 0 if and only if α1 = α2 = · · · = αn = 0
A vector space is a set of vectors that is closed under vector addition &scalar multiplications.
If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V
A basis of the vector space is the maximal set of vectors in thesubspace that are linearly independent of each other.
An orthogonal basis is a basis where all basis vectors are orthogonal toeach other.
Dimension of the vector space: number of vectors in the basis.
Linearly independent
Suppose we have 3 vectors x1, x2, x3
x1 = α2x2 + α3x3 ⇒ x1 is linearly dependent on x2 and x3
When are x1, · · · , xn linearly independent?
α1x1 + α2x2 + · · ·+ αnxn = 0 if and only if α1 = α2 = · · · = αn = 0
A vector space is a set of vectors that is closed under vector addition &scalar multiplications.
If x1, x2 ∈ V , then α1x1 + α2x2 ∈ V
A basis of the vector space is the maximal set of vectors in thesubspace that are linearly independent of each other.
An orthogonal basis is a basis where all basis vectors are orthogonal toeach other.
Dimension of the vector space: number of vectors in the basis.
Matrices
An m by n matrix: A ∈ Rm×n:a11 a12 · · · a1n...
.... . .
...am1 am2 · · · amn
A matrix is also a linear transform:
A : Rm → Rn
x → Axαx + βy→ A(αx + βy) = αAx + βAy (Linear Transform)
Matrix Norms
Popular matrix norm: Frobenius norm
‖A‖F =
√√√√ m∑i=1
n∑j=1
|aij |2
Matrix norms satisfy following properties:
‖A‖ ≥ 0 and ‖A‖ = 0 if and only if A = 0 (positivity)‖αA‖ = |α|‖A‖ (homogeneity)‖A + B‖ ≤ ‖A‖+ ‖B‖
Induced Norm
Given a vector norm ‖ · ‖, we can define the corresponding inducednorm or operator norm by
‖A‖ = supx{‖Ax‖ : ‖x‖ = 1}
= supx{‖Ax‖‖x‖
: x 6= 0}
Induced p-norm:
‖A‖p = supx 6=0
‖Ax‖p‖x‖p
Examples:
‖A‖2 = supx 6=0‖Ax‖2‖x‖2 = σmax(A) (induced 2-norm is the maximum
singular value)‖A‖1 = maxj
∑mi=1 |aij |
‖A‖∞ = maxi∑n
j=1 |aij |
Rank of a matrix
Column rank of A: the dimension of column space (vector spaceformed by column vectors)
Row rank of A: the dimension of row space
Column rank = row rank := rank (always true)
Examples:
Rank 2 matrix: 1 0 1−2 −3 13 3 0
=
1 0−2 −33 3
[1 0 10 1 −1
]
Rank 1 matrix:[1 1 0 2−1 −1 0 −2
]=
[1−1
] [1 1 0 2
]
QR decomposition
Matrix Decomposition: A = WH
Matrix Decomposition: A = WH
Matrix Decomposition: A = WH
Matrix Decomposition: A = WH
Matrix Decomposition: A = WH
QR Decomposition
Any m-by-n matrix A can be factorized into
A = QR
Q is an m-by-m unitary matrix (QTQ = I ), which is a basis for Rm.
R is an m-by-n triangular matrix.
If n < m, then the decomposition will be
A = [Q1Q2]
[R1
0
]= Q1R1
where Q1 ∈ Rm×n, R1 ∈ Rn×n.
This is the “economic” form of QR decomposition.
Full QR:
Economic QR:
Computing the QR decomposition
Given A ∈ Rm×n
a1 = q1R11 and ‖q1‖ = 1
Computing the QR decomposition
Given A ∈ Rm×n
a1 = q1R11 and ‖q1‖ = 1
q1 = a1/‖a1‖,R11 = ‖a1‖
Computing the QR decomposition
Given A ∈ Rm×n
a2 = q1R12 + q2R22 and ‖q2‖ = 1,qT2 q1 = 0
Computing the QR decomposition
Given A ∈ Rm×n
a2 = q1R12 + q2R22 and ‖q2‖ = 1,qT2 q1 = 0
R12 = qT1 a2
Computing the QR decomposition
Given A ∈ Rm×n
a2 = q1R12 + q2R22 and ‖q2‖ = 1,qT2 q1 = 0
R12 = qT1 a2
R22q2 = a2 − q1R12 := u2
⇒ q2 = u2/‖u2‖,R22 = ‖u2‖
Computing the QR decomposition
Given A ∈ Rm×n
a3 = q1R13 + q2R23 + q3R33 and ‖q3‖ = 1,qT3 q1 = 0,qT
3 q2 = 0
Computing the QR decomposition
Given A ∈ Rm×n
a3 = q1R13 + q2R23 + q3R33 and ‖q3‖ = 1,qT3 q1 = 0,qT
3 q2 = 0
R13 = qT1 a3
Computing the QR decomposition
Given A ∈ Rm×n
a3 = q1R13 + q2R23 + q3R33 and ‖q3‖ = 1,qT3 q1 = 0,qT
3 q2 = 0
R13 = qT1 a3
R23 = qT2 a3
Computing the QR decomposition
Given A ∈ Rm×n
a3 = q1R13 + q2R23 + q3R33 and ‖q3‖ = 1,qT3 q1 = 0,qT
3 q2 = 0
R13 = qT1 a3
R23 = qT2 a3
R33q3 = a3 − q1R13 − q2R23 := u3
⇒ q3 = u3/‖u3‖,R33 = ‖u3‖
Time Complexity
O(mn2) time (assume n ≤ m)
Very fast (the constant associated with time complexity is very small)
Can be used to
Find the orthogonal basis of a matrixCompute the rank of a matrixPre-process data for solving linear systems
QR Decomposition in python
Full QR
>>> from scipy import random, linalg
>>> A = random.rand(4, 2)
>>> q,r = linalg.qr(A)
>>> q.shape
(4, 4)
>>> r.shape
(4, 2)
Economic QR
>>> q,r = linalg.qr(A, mode=’economic’)
>>> q.shape
(4, 2)
>>> r.shape
(2, 2)
Singular Value Decomposition
Low-rank Approximation
Big Matrix A: Want to approximate by “simple” matrix A
Goodness of approximation of A can be measured by
‖A− A‖
(e.g., using ‖ · ‖F )
Low-rank approximation: A ≈ BC
Important Question
Given A and k , what is the best rank-k approximation?
minB,C are rank k
‖A− BC‖F
Minimizing B and C are obtained by SVD (Singular ValueDecomposition) of A
Singular Value Decomposition
Any real matrix A ∈ Rm×n has the following Singular ValueDecomposition (SVD):
A = U Σ V T
U is a m ×m unitary matrix (UTU = I ), columns of U are orthogonalV is a n × n unitary matrix (V TV = I ), columns of V are orthogonalΣ is a diagonal m × n matrix with non-negative real numbers on thediagonal.Usually, we assume the diagonal numbers are organized in descendingorder:
Σ = diag(σ1, σ2, · · · , σn),
σ1 ≥ σ2 ≥ · · · ≥ σn
Singular Value Decomposition
u1,u2, · · · ,um: left singular vectors, basis of column space
uTi uj = 0 ∀i 6= j , uT
i ui = 1 ∀i = 1, · · · ,mv1, v2, · · · , vn: right singular vectors, basis of row space
vTi vj = 0 ∀i 6= j , vT
i vi = 1 ∀i = 1, · · · , nSVD is unique (up to permutations, rotations)
Linear Transformations
Matrix A is a linear transformation
For right singular vectors v1, · · · , vn, what are the vectors aftertransformation?
A = UΣV T ⇒ AV = UΣV TV = UΣ
Therefore,Avi = σiui , i = 1, · · · , n
Linear Transformations
Matrix A is a linear transformation
For right singular vectors v1, · · · , vn, what are the vectors aftertransformation?
A = UΣV T ⇒ AV = UΣV TV = UΣ
Therefore,Avi = σiui , i = 1, · · · , n
Linear Transformations
For left singular vectors, we can derive the similar transformations usingAT .
AT = VΣTUT ⇒ ATU = VΣT
Therefore,ATui = viσi , i = 1, · · · ,m
Geometry of SVD
Given an input vector x , how do we define the linear transform Ax?
Assume A = UΣV T is the SVD, then Ax is:
Step 1: Represent x as
x = a1v1 + · · ·+ anvn, (ai = vTi a)
Step 2: Transform vi → σiui for all i :
Ax = a1σ1u1 + · · ·+ anσnvn
Reduced SVD (Thin SVD)
If m > n, then only first n left singular vectors u1, · · · ,un are useful:
Therefore, when m > n, we often just use the following “thin SVD”:
A = UΣV T where U ∈ Rm×n,Σ,V ∈ Rn×n:
U,V are unitary, and Σ is diagonal
Best Rank-k approximation
Given a matrix A, how to form the best rank k approximation?
arg minX :rank(X )=k
‖X − A‖ or arg minW∈Rn×k ,H∈Rm×k
‖X −WHT‖
Solution: truncated SVD:
A ≈ UrΣrVTr ,
where Ur ,Vr ,Σr are the first r columns of SVD matrices
Why? Keep the k-most important components in SVD
Application: Data Approximation/Compression
Given a data matrix A ∈ Rm×n
Approximate the data matrix by the best rank-k approximation:
A ≈WHT
Compress the data: O(mn) memory ⇒ O(mk + nk) memory
De-noising (sometimes)
Example: if each column of A is a data, then wi (i-th column of W ) is“dictionary”, and H is the coefficient:
ai ≈k∑
j=1
Hijwj
Eigenvalue Decomposition
Eigen Decomposition
For a n by n matrix H, ifHy = λy,
then we say
λ is an eigenvalue of Hy is the corresponding eigenvector
Eigen Decomposition
Consider A ∈ Rn×n to be a square, symmetric matrix. The eigenvaluedecomposition of A is:
A = VΛV T , V TV = I (V is unitary), Λ is diagonal
A = VΛV T ⇒ AV = VΛ
⇒ Avi = λivi , ∀i = 1, · · · , nEach vi is an eigenvector, and each λi is an eigenvalue
Usually, we assume the diagonal numbers are organized in descendingorder:
Λ = diag(λ1, λ2, · · · , λn),
λ1 ≥ λ2 ≥ · · · ≥ λn
Eigenvalue decomposition is unique when there are n uniqueeigenvalues.
Eigen Decomposition
Consider A ∈ Rn×n to be a square, symmetric matrix. The eigenvaluedecomposition of A is:
A = VΛV T , V TV = I (V is unitary), Λ is diagonal
A = VΛV T ⇒ AV = VΛ
⇒ Avi = λivi , ∀i = 1, · · · , nEach vi is an eigenvector, and each λi is an eigenvalue
Usually, we assume the diagonal numbers are organized in descendingorder:
Λ = diag(λ1, λ2, · · · , λn),
λ1 ≥ λ2 ≥ · · · ≥ λn
Eigenvalue decomposition is unique when there are n uniqueeigenvalues.
Eigen Decomposition
Each eigenvector vi will be mapped to Avi = λvi after the lineartransform:
Scaling without changing the direction of eigenvectors
Ax =∑m
i=1 λivi (vTi x)
Project x to eigenvectors, and then scaling each vector
Eigen Decomposition
How to compute SVD/eigen decomposition?
Eigen Decomposition & SVD
SVD can be transformed to eigenvalue decomposition!
Given A ∈ Rm×n, and A = UΣV T (SVD)
We have
AAT = UΣV TVΣUT = UΣ2UT (eigen decomposition of AAT )
After getting U,Σ, we can compute V by
V T = Σ−1UTA
ATA = VΣ2V T : Eigen decomposition of AAT
After getting V ,Σ, we can compute U = AVΣ−1
Eigen Decomposition & SVD
SVD can be transformed to eigenvalue decomposition!
Given A ∈ Rm×n, and A = UΣV T (SVD)
We have
AAT = UΣV TVΣUT = UΣ2UT (eigen decomposition of AAT )
After getting U,Σ, we can compute V by
V T = Σ−1UTA
ATA = VΣ2V T : Eigen decomposition of AAT
After getting V ,Σ, we can compute U = AVΣ−1
Eigen Decomposition & SVD
SVD can be transformed to eigenvalue decomposition!
Given A ∈ Rm×n, and A = UΣV T (SVD)
We have
AAT = UΣV TVΣUT = UΣ2UT (eigen decomposition of AAT )
After getting U,Σ, we can compute V by
V T = Σ−1UTA
ATA = VΣ2V T : Eigen decomposition of AAT
After getting V ,Σ, we can compute U = AVΣ−1
Eigen Decomposition & SVD
Which one should we use? eigenvalue decomposition of AAT or ATA?
Depends on dimensionality m vs n
If A ∈ Rm×n and m > n:
Step 1: Compute the eigenvalue decomposition of ATA = VΣ2V T
Step 2: Compute U = AVΣ−1
If A ∈ Rm×n and m < n:
Step 1: Compute the eigenvalue decomposition of AAT = UΣ2UT
Step 2: Compute V = ATUΣ−1
Eigen Decomposition & SVD
Which one should we use? eigenvalue decomposition of AAT or ATA?
Depends on dimensionality m vs n
If A ∈ Rm×n and m > n:
Step 1: Compute the eigenvalue decomposition of ATA = VΣ2V T
Step 2: Compute U = AVΣ−1
If A ∈ Rm×n and m < n:
Step 1: Compute the eigenvalue decomposition of AAT = UΣ2UT
Step 2: Compute V = ATUΣ−1
Computing Eigenvalue Decomposition
If A is a dense matrix:
Call scipy.linalg.eig
If A is a (large) sparse matrix: usually we only compute compute top-keigenvectors with a small k
Power methodLanczos algorithm (implemented in build-in packages)Call scipy.sparse.linalg.eigs
Dense Eigenvalue Decomposition
Only for dense matrix
Will compute all the eigenvalues and eigenvectors
>>> import numpy as np
>>> from scipy import linalg
>>> A = np.array([[1, 2], [3, 4]])
>>> S, V = linalg.eig(A)
>>> S
array([-0.37228132+0.j, 5.37228132+0.j])
>>> V
array([[-0.82456484, -0.41597356],
[ 0.56576746, -0.90937671]])
Dense SVD
Only for dense matrix
Specify “full matrices=False” for the thin SVD
>>> A = np.array([[1,2],[3,4], [5,6]])
>>> U, S, V = linalg.svd(A)
>>> U
array([[-0.2298477 , 0.88346102, 0.40824829],
[-0.52474482, 0.24078249, -0.81649658],
[-0.81964194, -0.40189603, 0.40824829]])
>>> s
array([ 9.52551809, 0.51430058])
>>> U, S, V = linalg.svd(A, full_matrices=False)
>>> U
array([[-0.2298477 , 0.88346102],
[-0.52474482, 0.24078249],
[-0.81964194, -0.40189603]])
Sparse Eigenvalue Decomposition
Usually only compute the top-k eigenvalues and eigenvectors (since theU,V will be dense)
Use iterative methods to compute Uk (top-k eigenvectors):
Initialize Uk by a n-by-k random matrix U(0)k
Iteratively update Uk :
U(0)k → U
(1)k → U
(2)k → · · ·
Converge to the SVD solution:
limt→∞
U(t)k = Uk
Power Iteration for top eigenvector
Main idea: compute (ATv)/‖ATv‖ for large T .
ATv/‖ATv‖ → top eigen vector when T →∞
Power Iteration for top eigenvector
Input: A ∈ Rd×d , number of iterations T
Initialize a random vector v ∈ Rd
For t = 1, 2, · · · ,Tv ← Avv ← v/‖v‖
Output v (top eigenvector)
Power Iteration for top eigenvector
Main idea: compute (ATv)/‖ATv‖ for large T .
ATv/‖ATv‖ → top eigen vector when T →∞
Power Iteration for top eigenvector
Input: A ∈ Rd×d , number of iterations T
Initialize a random vector v ∈ Rd
For t = 1, 2, · · · ,Tv ← Avv ← v/‖v‖
Output v (top eigenvector)
Power Iteration for top eigenvector
Power Iteration for top eigenvector
Input: A ∈ Rd×d , number of iterations T
Initialize a random vector v ∈ Rd
For t = 1, 2, · · · ,Tv ← Avv ← v/‖v‖
Output v (top eigenvector)
Top eigenvalue is vTAv (= vTλv = λ‖v‖2 = λ)
Only need one matrix-vector product at each iteartion
Time complexity: O(nnz(A)) per iteration
If A = XXT , then Av = X (XTv)
No need for explicitly forming A
Power Iteration
If A = UΛUT is the eigenvalue decomposition of A
What is the eigenvalue decomposition of At?
At = (UΛUT )(UΛUT ) · · · (UΛUT ) = U
λt1 0 · · · 00 λt2 · · · 0...
.... . .
...0 0 · · · λtd
UT
Atv =d∑
i=1
λti (uTi v)ui ∝ u1 + (
λ2λ1
)t(uT2 v
uT1 v
)u2 + · · ·+ (λdλ1
)t(uTd v
uT1 v
)ud
If v is a random vector, then uT1 v 6= 0 with probability 1, so
Atv‖Atv‖
→ u1 as t →∞
Power iteration for top k eigenvectors
Power Iteration for top eigenvector
Input: A ∈ Rd×d , number of iterations T , rank k
Initialize a random matrix V ∈ Rd×k
For t = 1, 2, · · · ,TV ← AV[Q,R]← qr(V )V ← Q
S ← V TAV
[U, S ]← eig(S)
Output eigenvectors U = VU, and eigenvalues S
Sometimes faster than default functions (linalg.eigs and linalg.svds), sincethey aim to compute very accurate solutions.
Coming up
Numerical Linear Algebra for Machine Learning
Questions?