Top Banner
8/9/2013 © MapR Confidential 1 R Hadoop and MapR
20

R user group 2011 09

Jul 02, 2015

Download

Technology

Talk given on September 2011 to the Bay Area R User Group. The talk walks a stochastic project SVD algorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: R user group 2011 09

8/9/2013 © MapR Confidential 1

RHadoop

and MapR

Page 2: R user group 2011 09

8/9/2013 © MapR Confidential 2

The bad old days (i.e. now)

• Hadoop is a silo

• HDFS isn’t a normal file system

• Hadoop doesn’t really like C++

• R is limited

• One machine, one memory space

• Isn’t there any way we can just get along?

Page 3: R user group 2011 09

8/9/2013 © MapR Confidential 3

The white knight

• MapR changes things

• Lots of new stuff like snapshots, NFS

• All you need to know, you already know

• NFS provides cluster wide file access

• Everything works the way you expect

• Performance high enough to use as a message bus

Page 4: R user group 2011 09

8/9/2013 © MapR Confidential 4

Example, out-of-core SVD

• SVD provides compressed matrix form

• Based on sum of rank-1 matrices

A =s1u1¢v1 +s 2u2

¢v2 +e

± ±≈ + + ?

Page 5: R user group 2011 09

8/9/2013 © MapR Confidential 5

More on SVD

• SVD provides a very nice basis

Ax = A aiviå = s ju j ¢v jj

åé

ëêê

ù

ûúú

aivii

åé

ëê

ù

ûú= ais iui

i

å

Page 6: R user group 2011 09

8/9/2013 © MapR Confidential 6

• And a nifty approximation property

Ax =s1a1u1 +s 2a2u2 + s iaiuii>2

å

e2

£ s i

2

i>2

å

Page 7: R user group 2011 09

8/9/2013 © MapR Confidential 7

Also known as …

• Latent Semantic Indexing

• PCA

• Eigenvectors

Page 8: R user group 2011 09

8/9/2013 © MapR Confidential 8

An application, approximate translation

• Translation distributes over concatenation

• But counting turns concatenation into addition

• This means that translation is linear!

T(s1 | s2 ) =T(s1) | T(s2 )

k(s1 | s2 ) = k(s1) + k(s2 )

k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))

Page 9: R user group 2011 09

8/9/2013 © MapR Confidential 9

ish

Page 10: R user group 2011 09

8/9/2013 © MapR Confidential 10

Traditional computation

• Products of A are dominated by large singular values and corresponding vectors

• Subtracting these dominate singular values allows the next ones to appear

• Lanczos method, generally Krylov sub-space

A ¢AA( )n

=US2n+1 ¢V

Page 11: R user group 2011 09

8/9/2013 © MapR Confidential 11

But …

Page 12: R user group 2011 09

8/9/2013 © MapR Confidential 12

The gotcha

• Iteration in Hadoop is death

• Huge process invocation costs

• Lose all memory residency of data

• Total lost cause

Page 13: R user group 2011 09

8/9/2013 © MapR Confidential 13

Randomness to the rescue

• To save the day, run all iterations at the same time

Y = AW

QR =Y

B = ¢QA

US ¢V = B

QU( ) S ¢V » A

==

A

Page 14: R user group 2011 09

8/9/2013 © MapR Confidential 14

In R

lsa = function(a, k, p) {n = dim(a)[1]m = dim(a)[2]y = a %*% matrix(rnorm(m*(k+p)), nrow=m)y.qr = qr(y)b = t(qr.Q(y.qr)) %*% ab.qr = qr(t(b))svd = svd(t(qr.R(b.qr)))list(u=qr.Q(y.qr) %*% svd$u[,1:k],

d=svd$d[1:k], v=qr.Q(b.qr) %*% svd$v[,1:k])

}

Page 15: R user group 2011 09

8/9/2013 © MapR Confidential 15

Not good enough yet

• Limited to memory size

• After memory limits, feature extraction dominates

Page 16: R user group 2011 09

8/9/2013 © MapR Confidential 16

Hybrid architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSVD

Map-reduce

Via NFS

Page 17: R user group 2011 09

8/9/2013 © MapR Confidential 17

Hybrid architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

Map-reduce

Via NFS

RVisualization

SequentialSVD

Page 18: R user group 2011 09

8/9/2013 © MapR Confidential 18

Randomness to the rescue

• To save the day again, use blocks

Yi = AiW

¢R R = ¢Y Y = ¢YiYiå

B j = AiWR-1( )Aij

i

å

LL ' = B ¢B

US ¢V = L

AWR-1U( )S L-1B ¢V( ) » A

==

=

Page 19: R user group 2011 09

8/9/2013 © MapR Confidential 19

Hybrid architecture

Map-reduce

Feature extractionand

down sampling Via NFS

RVisualization

Map-reduce

Block-wiseparallel

SVD

Page 20: R user group 2011 09

8/9/2013 © MapR Confidential 20

Conclusions

• Inter-operability allows massively scalability

• Prototyping in R not wasted

• Map-reduce iteration not needed for SVD

• Feasible scale ~10^9 non-zeros or more