Direct tall-and-skinny QR factorizations in MapReduce architectures

Tall-and-Skinny QR Factorizations in MapReduce DAVID F. GLEICH !

PURDUE UNIVERSITY COMPUTER SCIENCE !

DEPARTMENT

PAUL G. CONSTANTINE AUSTIN BENSON !JOE NICHOLS ! STANFORD UNIVERSITY JAMES DEMMEL ! UC BERKELEY JOE RUTHRUFF !JEREMY TEMPLETON ! SANDIA

1 Cornell CS David Gleich · Purdue

Cornell CS David Gleich · Purdue 2

Questions?

Most recent code at http://github.com/arbenson/mrtsqr

Quick review of QR QR Factorization

David Gleich (Sandia)

Using QR for regression

  is given by the solution of  

QR is block normalization“normalize” a vector usually generalizes to computing   in the QR

A Q

Let   , real

 

  is   orthogonal (   )

  is   upper triangular.

0

R

=

4/22MapReduce 2011 3 Cornell CS David Gleich · Purdue

QR Factorization

David Gleich (Sandia)

Using QR for regression

  is given by the solution of  

QR is block normalization“normalize” a vector usually generalizes to computing   in the QR

A Q

Let   , real

 

  is   orthogonal (   )

  is   upper triangular.

0

R

=

4/22MapReduce 2011

4

Tall-and-Skinny matrices (m ≫ n)

A

Cornell CS David Gleich · Purdue

5

Tall-and-Skinny matrices !(m ≫ n) arise in

regression with many samples block iterative methods panel factorizations model reduction problems!general linear models "with many samples tall-and-skinny SVD/PCA

All of these applications !need a QR factorization of !a tall-and-skinny matrix.! some only need R !

A

From tinyimages"collection


Input "Parameters

Time history"of simulation

s f"~100GB

The Database

s1 -> f1 s2 -> f2

sk -> fk

f(s) =

2

66666666666664

q(x1, t1, s)...

q(xn

, t1, s)q(x1, t2, s)

...q(x

n

, t2, s)...

q(xn

, t

k

, s)

3

77777777777775

A single simulation at one time step

X =⇥f(s1) f(s2) ... f(sp)

⇤

The database as a very"tall-and-skinny matrix Th

e sim

ulatio

n as

a v

ecto

r


Dynamic Mode Decomposition One simulation, ~10TB of data, compute the SVD of a space-by-time matrix.


DMD video

MapReduce It’s a computational model "and a framework.


MapReduce


The MapReduce Framework Originated at Google for indexing web pages and computing PageRank.

Express algorithms in "data-local operations. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.

MM R

RMM

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

Data scalable

Fault-tolerance by design

10


Computing variance in MapReduce Run 1 Run 2 Run 3

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

11


Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

12


MapReduce vs. Hadoop.

MapReduce!A computation model with:"Map a local data transform"Shuffle a grouping function "Reduce an aggregation

Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !

Pheonix++, Twisted, Google MapReduce, spark …


Current state of the art for MapReduce QR

MapReduce is often used to compute the principal components of large datasets. These approaches all form the normal equations and work with it.


AT A

MapReduce is great for TSQR!!You don’t need ATA Data A tall and skinny (TS) matrix by rows Input 500,000,000-by-50 matrix"Each record 1-by-50 row"HDFS Size 183.6 GB Time to compute read A 253 sec. write A 848 sec.!Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "Time to compute Q in qr(A) 3090 sec. (numerically stable)!

Cornell CS David Gleich · Purdue 15/2

2

Tall-and-Skinny QR


Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR

Demmel et al. 2008. Communicating avoiding parallel and sequential QR.

First, do QR factorizationsof each local matrix  

Second, compute a QR factorization of the new “R”

David Gleich (Sandia) 6/22MapReduce 2011

17


Serial QR factorizations!(Demmel et al. 2008) Fully serial TSQR


Compute QR of   , read   , update QR, …


18


Tall-and-skinny matrix storage in MapReduce MapReduce matrix storage

 

Key is an arbitrary row-idValue is the   array for

a row.

Each submatrix   is an input split.

A1

A2

A3

A4


19


You can also store multiple rows together. It goes a little faster.

A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

20


Key Limitation Computes only R and not Q

Can get Q via Q = AR-1 with another MR iteration. Numerical stability: dubious although iterative refinement helps.


kQT Q � Ik is large

Achieving numerical stability


Condition number 1020 105

norm

( Q

T Q –

I )

AR-1

AR-1 + "

iterative refinement Direct TSQR

Why MapReduce?


Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper

def compress(self):R = numpy.linalg.qr(

numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])

def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()

def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row

def mapper(self,key,value):self.collect(key,value)

def reducer(self,key,values):for value in values: self.mapper(key,value)

if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)

David Gleich (Sandia) 13/22MapReduce 2011 24


3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�

:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� ZLWK�RQO\�a��SHUIRUPDQFH�SHQDOW\��+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�

Fault injection


10 100 1000

1/Prob(failure) – mean number of success per failure

Tim

e to

com

plet

ion

(sec)

200

100

No faults (200M by 200)

Faults (800M by 10)

Faults (200M by 200)

No faults "(800M by 10)

With 1/5 tasks failing, the job only takes twice as long.

How to get Q?


Idea 1 (unstable)


A1

A4

Q1 R-1

Mapper 1

A2 Q2

A3 Q3

A4 Q4

R TSQR

Distribute R

R-1

R-1

R-1

Idea 2 (better)


A1

A4

Q1 R-1

Mapper 1

A2 Q2

A3 Q3

A4 Q4

R TSQR

Distribute R

R-1

R-1

R-1

There’s a famous quote that “two iterations of iterative refinement are enough” attributed to Parlett

TSQR

Q1

A4

Q1 T-1

Mapper 2

Q2 Q2

Q3 Q3

Q4 Q4

T

Distribute T

T-1

T-1

T-1

Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR


First, do QR factorizationsof each local matrix  

Second, compute a QR factorization of the new “R”


29


Idea 3 (best!)


A1

A4

Q1 R1

Mapper 1

A2 Q2 R2

A3 Q3 R3

A4 Q4

Q1

Q2

Q3

Q4

R1

R2

R3

R4

R4 Q o

utpu

t

R ou

tput

Q11

Q21

Q31

Q41

R Task 2

Q11

Q21

Q31

Q41

Q1

Q2

Q3

Q4

Mapper 3

1. Output local Q and R in separate files

2. Collect R on one node, compute Qs for each piece

3. Distribute the pieces of Q*1 and form the true Q

The price is right!


seco

nds

2500

500

Full TSQR is faster than refinement for few columns

… and not any slower for many columns.

What can we do now?


PCA of 80,000,000!images

33/2

2

A

80,000,000 images

1000 pixels

X

MapReduce Post Processing

Zero"mean"rows

TSQ

R

R SVD

  V

First 16 columns

of V as images

Top 100 singular values

(principal �components)


A Large Scale Example

Nonlinear heat transfer model 80k nodes, 300 time-steps 104 basis runs SVD of 24m x 104 data matrix 500x reduction in wall clock time (100x including the SVD) 34

ICASSP David Gleich · Purdue

What’s next? Investigate randomized algorithms for computing SVDs for fatter matrices.


RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION 9

Algorithm: Randomized PCA

Given an m ! n matrix A, the number k of principal components, and anexponent q, this procedure computes an approximate rank-2k factorizationU!V !. The columns of U estimate the first 2k principal components of A.

Stage A:1 Generate an n ! 2k Gaussian test matrix ".2 Form Y = (AA!)qA" by multiplying alternately with A and A!

3 Construct a matrix Q whose columns form an orthonormal basis for therange of Y .

Stage B:1 Form B = Q!A.2 Compute an SVD of the small matrix: B = !U!V !.3 Set U = Q !U .

singular spectrum of the data matrix often decays quite slowly. To address this di!-culty, we incorporate q steps of a power iteration where q = 1, 2 is usually su!cientin practice. The complete scheme appears below as the Randomized PCA algorithm.For refinements, see the discussion in §§4–5.

This procedure requires only 2(q +1) passes over the matrix, so it is e!cient evenfor matrices stored out-of-core. The flop count is

TrandPCA " qk Tmult + k2(m + n),

where Tmult is the cost of a matrix–vector multiply with A or A!.We have the following theorem on the performance of this method, which is a

consequence of Corollary 10.10.

Theorem 1.2. Suppose that A is a real m!n matrix. Select an exponent q anda target number k of principal components, where 2 # k # 0.5 min{m, n}. Executethe randomized PCA algorithm to obtain a rank-2k factorization U!V !. Then

E $A % U!V !$ #

"

1 + 4

#2 min{m, n}

k % 1

$1/(2q+1)

!k+1, (1.8)

where E denotes expectation with respect to the random test matrix and !k+1 is the(k + 1)th singular value of A.

This result is new. Observe that the bracket in (1.8) is essentially the same as thebracket in the basic error bound (1.6). We find that the power iteration drives theleading constant to one exponentially fast as the power q increases. Since the rank-kapproximation of A can never achieve an error smaller than !k+1, the randomizedprocedure estimates 2k principal components that carry essentially as much varianceas the first k actual principal components. Truncating the approximation to the firstk terms typically produces very accurate results.

1.7. Outline of paper. The paper is organized into three parts: an introduction(§§1–3), a description of the algorithms (§§4–7), and a theoretical performance analysis(§§8–11). Each part commences with a short internal outline, and the three segments

Halko, Martinsson, Tropp. SIREV 2011


Questions?

Most recent code at http://github.com/arbenson/mrtsqr

Direct tall-and-skinny QR factorizations in MapReduce architectures

Documents

david gleich purduecornell

parallel cornell cs

demmel tsqr compute

solution of qr

realusing qr

update qr

google mapreduce

mapreduce frameworkoriginated