Tall-and-Skinny QR Factorizations in MapReduce DAVID F. GLEICH PURDUE UNIVERSITY COMPUTER SCIENCE DEPARTMENT PAUL G. CONSTANTINE AUSTIN BENSON JOE NICHOLS STANFORD UNIVERSITY JAMES DEMMEL UC BERKELEY JOE RUTHRUFF JEREMY TEMPLETON SANDIA 1 Cornell CS David Gleich · Purdue
36
Embed
Direct tall-and-skinny QR factorizations in MapReduce architectures
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tall-and-Skinny QR Factorizations in MapReduce DAVID F. GLEICH !
PURDUE UNIVERSITY COMPUTER SCIENCE !
DEPARTMENT
PAUL G. CONSTANTINE AUSTIN BENSON !JOE NICHOLS ! STANFORD UNIVERSITY JAMES DEMMEL ! UC BERKELEY JOE RUTHRUFF !JEREMY TEMPLETON ! SANDIA
1 Cornell CS David Gleich · Purdue
Cornell CS David Gleich · Purdue 2
Questions?
Most recent code at http://github.com/arbenson/mrtsqr
Quick review of QR QR Factorization
David Gleich (Sandia)
Using QR for regression
is given by the solution of
QR is block normalization“normalize” a vector usually generalizes to computing in the QR
A Q
Let , real
is orthogonal ( )
is upper triangular.
0
R
=
4/22MapReduce 2011 3 Cornell CS David Gleich · Purdue
QR Factorization
David Gleich (Sandia)
Using QR for regression
is given by the solution of
QR is block normalization“normalize” a vector usually generalizes to computing in the QR
A Q
Let , real
is orthogonal ( )
is upper triangular.
0
R
=
4/22MapReduce 2011
4
Tall-and-Skinny matrices (m ≫ n)
A
Cornell CS David Gleich · Purdue
5
Tall-and-Skinny matrices !(m ≫ n) arise in
regression with many samples block iterative methods panel factorizations model reduction problems!general linear models "with many samples tall-and-skinny SVD/PCA
All of these applications !need a QR factorization of !a tall-and-skinny matrix.! some only need R !
A
From tinyimages"collection
Cornell CS David Gleich · Purdue
Input "Parameters
Time history"of simulation
s f"~100GB
The Database
s1 -> f1 s2 -> f2
sk -> fk
f(s) =
2
66666666666664
q(x1, t1, s)...
q(xn
, t1, s)q(x1, t2, s)
...q(x
n
, t2, s)...
q(xn
, t
k
, s)
3
77777777777775
A single simulation at one time step
X =⇥f(s1) f(s2) ... f(sp)
⇤
The database as a very"tall-and-skinny matrix Th
e sim
ulatio
n as
a v
ecto
r
Cornell CS David Gleich · Purdue 6
Dynamic Mode Decomposition One simulation, ~10TB of data, compute the SVD of a space-by-time matrix.
Cornell CS David Gleich · Purdue 7
DMD video
MapReduce It’s a computational model "and a framework.
Cornell CS David Gleich · Purdue 8
MapReduce
Cornell CS David Gleich · Purdue 9
The MapReduce Framework Originated at Google for indexing web pages and computing PageRank.
Express algorithms in "data-local operations. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.
MM R
RMM
Input stored in triplicate
Map output"persisted to disk"before shuffle
Reduce input/"output on disk
1 MM R
RMMM
Maps Reduce
Shuffle
2
3
4
5
1 2 M M
3 4 M M
5 M
Data scalable
Fault-tolerance by design
10
Cornell CS David Gleich · Purdue
Computing variance in MapReduce Run 1 Run 2 Run 3
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
11
Cornell CS David Gleich · Purdue
Mesh point variance in MapReduce
M M M
R R
1. Each mapper out-puts the mesh points with the same key.
2. Shuffle moves all values from the same mesh point to the same reducer.
Run 1 Run 2 Run 3
3. Reducers just compute a numerical variance.
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
12
Cornell CS David Gleich · Purdue
MapReduce vs. Hadoop.
MapReduce!A computation model with:"Map a local data transform"Shuffle a grouping function "Reduce an aggregation
Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !
Pheonix++, Twisted, Google MapReduce, spark …
Cornell CS David Gleich · Purdue 13
Current state of the art for MapReduce QR
MapReduce is often used to compute the principal components of large datasets. These approaches all form the normal equations and work with it.
Cornell CS David Gleich · Purdue 14
AT A
MapReduce is great for TSQR!!You don’t need ATA Data A tall and skinny (TS) matrix by rows Input 500,000,000-by-50 matrix"Each record 1-by-50 row"HDFS Size 183.6 GB Time to compute read A 253 sec. write A 848 sec.!Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "Time to compute Q in qr(A) 3090 sec. (numerically stable)!
Cornell CS David Gleich · Purdue 15/2
2
Tall-and-Skinny QR
Cornell CS David Gleich · Purdue 16
Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
First, do QR factorizationsof each local matrix
Second, compute a QR factorization of the new “R”
David Gleich (Sandia) 6/22MapReduce 2011
17
Cornell CS David Gleich · Purdue
Serial QR factorizations!(Demmel et al. 2008) Fully serial TSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
Compute QR of , read , update QR, …
David Gleich (Sandia) 8/22MapReduce 2011
18
Cornell CS David Gleich · Purdue
Tall-and-skinny matrix storage in MapReduce MapReduce matrix storage
Key is an arbitrary row-idValue is the array for
a row.
Each submatrix is an input split.
A1
A2
A3
A4
David Gleich (Sandia) 10/22MapReduce 2011
19
Cornell CS David Gleich · Purdue
You can also store multiple rows together. It goes a little faster.
A1
A2
A3
A1
A2qr
Q2 R2
A3qr
Q3 R3
A4qr Q4A4
R4
emit
A5
A6
A7
A5
A6qr
Q6 R6
A7qr
Q7 R7
A8qr Q8A8
R8
emit
Mapper 1Serial TSQR
R4
R8
Mapper 2Serial TSQR
R4
R8
qr Q emitRReducer 1Serial TSQR
AlgorithmData Rows of a matrix
Map QR factorization of rowsReduce QR factorization of rows
20
Cornell CS David Gleich · Purdue
Key Limitation Computes only R and not Q
Can get Q via Q = AR-1 with another MR iteration. Numerical stability: dubious although iterative refinement helps.
Cornell CS David Gleich · Purdue 21
kQT Q � Ik is large
Achieving numerical stability
Cornell CS David Gleich · Purdue 22
Condition number 1020 105
norm
( Q
T Q –
I )
AR-1
AR-1 + "
iterative refinement Direct TSQR
Why MapReduce?
Cornell CS David Gleich · Purdue 23
Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper
def compress(self):R = numpy.linalg.qr(
numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])
1/Prob(failure) – mean number of success per failure
Tim
e to
com
plet
ion
(sec)
200
100
No faults (200M by 200)
Faults (800M by 10)
Faults (200M by 200)
No faults "(800M by 10)
With 1/5 tasks failing, the job only takes twice as long.
How to get Q?
Cornell CS David Gleich · Purdue 26
Idea 1 (unstable)
Cornell CS David Gleich · Purdue 27
A1
A4
Q1 R-1
Mapper 1
A2 Q2
A3 Q3
A4 Q4
R TSQR
Distribute R
R-1
R-1
R-1
Idea 2 (better)
Cornell CS David Gleich · Purdue 28
A1
A4
Q1 R-1
Mapper 1
A2 Q2
A3 Q3
A4 Q4
R TSQR
Distribute R
R-1
R-1
R-1
There’s a famous quote that “two iterations of iterative refinement are enough” attributed to Parlett
TSQR
Q1
A4
Q1 T-1
Mapper 2
Q2 Q2
Q3 Q3
Q4 Q4
T
Distribute T
T-1
T-1
T-1
Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
First, do QR factorizationsof each local matrix
Second, compute a QR factorization of the new “R”
David Gleich (Sandia) 6/22MapReduce 2011
29
Cornell CS David Gleich · Purdue
Idea 3 (best!)
Cornell CS David Gleich · Purdue 30
A1
A4
Q1 R1
Mapper 1
A2 Q2 R2
A3 Q3 R3
A4 Q4
Q1
Q2
Q3
Q4
R1
R2
R3
R4
R4 Q o
utpu
t
R ou
tput
Q11
Q21
Q31
Q41
R Task 2
Q11
Q21
Q31
Q41
Q1
Q2
Q3
Q4
Mapper 3
1. Output local Q and R in separate files
2. Collect R on one node, compute Qs for each piece
3. Distribute the pieces of Q*1 and form the true Q
The price is right!
Cornell CS David Gleich · Purdue 31
seco
nds
2500
500
Full TSQR is faster than refinement for few columns
… and not any slower for many columns.
What can we do now?
Cornell CS David Gleich · Purdue 32
PCA of 80,000,000!images
33/2
2
A
80,000,000 images
1000 pixels
X
MapReduce Post Processing
Zero"mean"rows
TSQ
R
R SVD
V
First 16 columns
of V as images
Top 100 singular values
(principal �components)
Cornell CS David Gleich · Purdue
A Large Scale Example
Nonlinear heat transfer model 80k nodes, 300 time-steps 104 basis runs SVD of 24m x 104 data matrix 500x reduction in wall clock time (100x including the SVD) 34
ICASSP David Gleich · Purdue
What’s next? Investigate randomized algorithms for computing SVDs for fatter matrices.
Cornell CS David Gleich · Purdue 35
RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION 9
Algorithm: Randomized PCA
Given an m ! n matrix A, the number k of principal components, and anexponent q, this procedure computes an approximate rank-2k factorizationU!V !. The columns of U estimate the first 2k principal components of A.
Stage A:1 Generate an n ! 2k Gaussian test matrix ".2 Form Y = (AA!)qA" by multiplying alternately with A and A!
3 Construct a matrix Q whose columns form an orthonormal basis for therange of Y .
Stage B:1 Form B = Q!A.2 Compute an SVD of the small matrix: B = !U!V !.3 Set U = Q !U .
singular spectrum of the data matrix often decays quite slowly. To address this di!-culty, we incorporate q steps of a power iteration where q = 1, 2 is usually su!cientin practice. The complete scheme appears below as the Randomized PCA algorithm.For refinements, see the discussion in §§4–5.
This procedure requires only 2(q +1) passes over the matrix, so it is e!cient evenfor matrices stored out-of-core. The flop count is
TrandPCA " qk Tmult + k2(m + n),
where Tmult is the cost of a matrix–vector multiply with A or A!.We have the following theorem on the performance of this method, which is a
consequence of Corollary 10.10.
Theorem 1.2. Suppose that A is a real m!n matrix. Select an exponent q anda target number k of principal components, where 2 # k # 0.5 min{m, n}. Executethe randomized PCA algorithm to obtain a rank-2k factorization U!V !. Then
E $A % U!V !$ #
"
1 + 4
#2 min{m, n}
k % 1
$1/(2q+1)
!k+1, (1.8)
where E denotes expectation with respect to the random test matrix and !k+1 is the(k + 1)th singular value of A.
This result is new. Observe that the bracket in (1.8) is essentially the same as thebracket in the basic error bound (1.6). We find that the power iteration drives theleading constant to one exponentially fast as the power q increases. Since the rank-kapproximation of A can never achieve an error smaller than !k+1, the randomizedprocedure estimates 2k principal components that carry essentially as much varianceas the first k actual principal components. Truncating the approximation to the firstk terms typically produces very accurate results.
1.7. Outline of paper. The paper is organized into three parts: an introduction(§§1–3), a description of the algorithms (§§4–7), and a theoretical performance analysis(§§8–11). Each part commences with a short internal outline, and the three segments
Halko, Martinsson, Tropp. SIREV 2011
Cornell CS David Gleich · Purdue 36
Questions?
Most recent code at http://github.com/arbenson/mrtsqr