Top Banner
EE270 Large scale matrix computation, optimization and learning Instructor : Mert Pilanci Stanford University Thursday, Jan 14 2020
29

EE270 Large scale matrix computation, optimization and ...

Jun 08, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EE270 Large scale matrix computation, optimization and ...

EE270Large scale matrix computation,

optimization and learning

Instructor : Mert Pilanci

Stanford University

Thursday, Jan 14 2020

Page 2: EE270 Large scale matrix computation, optimization and ...

Randomized Linear AlgebraLecture 3: Applications of AMM, Error Analysis,

Trace Estimation and Bootstrap

Page 3: EE270 Large scale matrix computation, optimization and ...

Approximate Matrix Multiplication

Algorithm 1 Approximate Matrix Multiplication via Sampling

Input: An n × d matrix A and an d × p matrix B, an integer mand probabilities {pk}dk=1

Output: Matrices CR such that CR ≈ AB

1: for t = 1 to m do2: Pick it ∈ {1, ..., d} with probability P[it = k] = pk in i.i.d.

with replacement3: Set C (t) = 1√

mpitA(it) and R(t) = 1√

mpitB(it)

4: end for

I We can multiply CR using the classical algorithm

I Complexity O(nmp)

Page 4: EE270 Large scale matrix computation, optimization and ...

AMM mean and variance

AB ≈ CR =1

m

m∑t=1

1

pitA(it)B(it)

I Mean and variance of the matrix multiplication estimator

Lemma

I E [(CR)ij ] = (AB)ij

I Var [(CR)ij ] = 1m

∑dk=1

A2ikB

2kj

pk− 1

m (AB)2ij

I E‖AB − CR‖2F =∑

ij E(AB − CR)2ij =∑

ij Var[(CR)ij ]

= 1m

∑dk=1

∑i A

2ik

∑j B

2kj

pk− 1

m‖AB‖2F

= 1m

∑dk=1

1pk‖A(k)‖22‖B(k)‖22 − 1

m‖AB‖2F

Page 5: EE270 Large scale matrix computation, optimization and ...

Optimal sampling probabilities

I Nonuniform sampling

pk =‖A(k)‖2‖B(k)‖2∑i ‖A(k)‖2‖B(k)‖2

I minimizes E‖AB − CR‖FI E‖AB − CR‖2F = 1

m

∑dk=1

1pk‖A(k)‖22‖B(k)‖22 − 1

m‖AB‖2F

= 1m

(∑dk=1 ‖A(k)‖2‖B(k)‖2

)2− 1

m‖AB‖2F

is the optimal error

Page 6: EE270 Large scale matrix computation, optimization and ...

Final Probability Bound for `2-norm sampling

I For any δ > 0, set m = 1δ ε2

to obtain

P [‖AB − CR‖F > ε‖A‖F‖B‖F ] ≤ δ (1)

I i.e., ‖AB − CR‖F < ε‖A‖F‖B‖F with probability 1− δI note that m is independent of any dimensions

Page 7: EE270 Large scale matrix computation, optimization and ...

Numerical simulations for AMM

I Approximating ATA

a subset of the CIFAR dataset

0 10 20 30 40 50 60 70 80 90 100

q - number of row samples

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Err

or

in s

pectr

al norm

norm

aliz

ed b

y ||A

||F4

uniform

L2 norms

Page 8: EE270 Large scale matrix computation, optimization and ...

Numerical simulations for AMM

I Approximating ATA

sparse matrix from a computational fluid dynamics model

0 10 20 30 40 50 60 70 80 90 100

q - number of row samples

0

10

20

30

40

50

60

70E

rror

in s

pectr

al norm

norm

aliz

ed b

y ||A

||F4

uniform

L2 norms

SuiteSparse Matrix Collection: https://sparse.tamu.edu

Page 9: EE270 Large scale matrix computation, optimization and ...

Sampling with replacement vs without replacement

SuiteSparse Matrix Collection: https://sparse.tamu.edu

Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.

Page 10: EE270 Large scale matrix computation, optimization and ...

Applications of Approximate Matrix MultiplicationI Simultaneous Localization and Mapping (SLAM)

Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.

Page 11: EE270 Large scale matrix computation, optimization and ...

Applications of Approximate Matrix Multiplication

Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.

Page 12: EE270 Large scale matrix computation, optimization and ...

Applications of Approximate Matrix Multiplication

Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.

Page 13: EE270 Large scale matrix computation, optimization and ...

Neural Networks

I Given image x

I Classify into M classes

I Neural network f (x) = WL(...s(W2(s(W1x))))

I W1,..., WL are trained weight matrices

LeCun et al. (1998)

Page 14: EE270 Large scale matrix computation, optimization and ...

Neural Networks

LeCun et al. (1998)

Page 15: EE270 Large scale matrix computation, optimization and ...

AMM for neural networks

Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.

Page 16: EE270 Large scale matrix computation, optimization and ...

Probing the actual error

I AB ≈ CR

I ∆ , AB − CR

I How large is the error ‖∆‖F ?

I ‖∆‖2F = tr(∆T∆

)I trace of a matrix B

I trB),∑

i Bii

I trace estimation

Page 17: EE270 Large scale matrix computation, optimization and ...

Trace estimation

I Let B an n × n symmetric matrix

I Let u1, ..., un be n i.i.d. samples of a random variable U withmean zero and variance σ2

I Lemma

E[uTBu] = σ2tr(B)

Var[uTBu] = 2σ4∑

i 6=j B2ij +

(E[U4]− σ4

)∑i B

2ii

Page 18: EE270 Large scale matrix computation, optimization and ...

Trace estimation: optimal sampling distribution

I Let B an n × n symmetric matrix

I Let u1, ..., un be n i.i.d. samples of a random variable U withmean zero and variance σ2

E[uTBu] = σ2tr(B)

Var[uTBu] = 2σ4∑

i 6=j B2ij +

(E[U4]− σ4

)∑i B

2ii

I minimum variance unbiased estimator

minp(U)

Var[uTBu]

subject to E[uTBu] = tr(B)

I Var(U2) = E[U4]− σ4 ≥ 0

I minimized when Var(U2) = 0

I U2 = 1 with probability one

Page 19: EE270 Large scale matrix computation, optimization and ...

Trace estimation: optimal sampling distribution

I Let B an n × n symmetric matrix

I Let u1, ..., un be n i.i.d. samples of a random variable U withmean zero and variance σ2

E[uTBu] = σ2tr(B)

Var[uTBu] = 2σ4∑

i 6=j B2ij +

(E[U4]− σ4

)∑i B

2ii

I minimum variance unbiased estimator

minp(U)

Var[uTBu]

subject to E[uTBu] = tr(B)

I Var(U2) = E[U4]− σ4 ≥ 0

I minimized when Var(U2) = 0

I U2 = 1 with probability one

Page 20: EE270 Large scale matrix computation, optimization and ...

Optimal trace estimation

I Let B be an n × n symmetric matrix with non-zero trace

Let U be the discrete random variable which takes values1,−1 each with probability 1

2 (Rademacher distribution)

Let u = [u1, ..., un]T be i.i.d. ∼ U

I uTBu is an unbiased estimator tr(B) and

Var[uTBu] = 2∑i 6=j

B2ij .

I U is the unique variable amongst zero mean random variablesfor which uTBu is a minimum variance, unbiased estimator oftr(B).

Hutchinson (1990)

Page 21: EE270 Large scale matrix computation, optimization and ...

Application to Approximate Matrix Multiplication

I ‖AB − CR‖2F = tr((AB − CR)T (AB − CR)

)I can be estimated via

I uT (AB − CR)T (AB − CR)u = ‖(AB − CR)u‖22I only requires matrix-vector products

where u = [u1, ..., un]T is i.i.d. ±1 each with probability 12

I variance can be reduced by averaging independent trials

Page 22: EE270 Large scale matrix computation, optimization and ...

Sampling/Sketching Matrix Formalism

I Define the sampling matrix

Sij =

{1 if the i-th column of A is chosen in the j-th trial

0 otherwise

I diagonal re-weighting matrix

Dtt =1

√mpit

I AB ≈ CR

C = ASD and R = DSTB

I let S = DST

CR = ASDDSTB = ASTSB

Page 23: EE270 Large scale matrix computation, optimization and ...

Sampling/Sketching Matrix Formalism

I Define the sampling matrix

Sij =

{1 if the i-th column of A is chosen in the j-th trial

0 otherwise

I diagonal re-weighting matrix

Dtt =1

√mpit

I AB ≈ CR

C = ASD and R = DSTB

I let S = DST

CR = ASDDSTB = ASTSB

Page 24: EE270 Large scale matrix computation, optimization and ...

Estimating the entry-wise error

I infinity norm error

I ε(S) , ‖ASTSB − AB‖∞ = maxij |(ASTSB)ij − (AB)ij |I 0.99-quantile of ε(S) is the tightest upper bound that holds

with probability at least 0.99

Page 25: EE270 Large scale matrix computation, optimization and ...

Estimating the entry-wise error

I infinity norm error

I ε(S) , ‖ASTSB − AB‖∞ = maxij |(ASTSB)ij − (AB)ij |I 0.99-quantile of ε(S) is the tightest upper bound that holds

with probability at least 0.99

Page 26: EE270 Large scale matrix computation, optimization and ...

Estimating the entry-wise error

I infinity norm error

I ε(S) , ‖ASTSB − AB‖∞ = maxij |(ASTSB)ij − (AB)ij |I 0.99-quantile of ε(S) is the tightest upper bound that holds

with probability at least 0.99

I Bootstrap procedure:

For b = 1, ...,B do

sample m numbers with replacement from {1, ...,m}form Sb by selecting the the respective rows of S

compute εb = ‖ASTb SbB − ASTSB‖∞

return 0.99-quantile of the values ε1, ..., εB

e.g., sort in increasing order and return b0.99Bc-th value

I imitates the random mechanism that originally generatedASTSB

A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication. Lopes et al.

Page 27: EE270 Large scale matrix computation, optimization and ...

Extrapolating the error

I ε(S) , ‖ASTSB − AB‖∞I for sufficiently large m

I 0.99-quantile of ε(S) ≈ κ√m

where κ is an unknown number

I given initial sketch of size m0

we can extrapolate the error for m > m0 via the Bootstrapestimate as

√m0√mε(S)

Page 28: EE270 Large scale matrix computation, optimization and ...

Extrapolation: Numerical example

I Protein dataset (n = 17766, d = 356)The black line is the 0.99-quantile as a function of m. Theblue star is the average bootstrap estimate at the initialsketch size m0 = 500, and the blue line represents the averageextrapolated estimate derived from the starting value m0.A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication. Lopes et al.

Page 29: EE270 Large scale matrix computation, optimization and ...

Questions?