Matrix Factorization via SGD

Post on 05-Jan-2016

56 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Matrix Factorization via SGD. Background. Recovering latent factors in a matrix. r. m movies. m movies. ~. H. W. V. n users. V[ i,j ] = user i’s rating of movie j. MF VIA SGD. Matrix factorization as SGD. local gradient. …scaled up by N to approximate gradient. step size. - PowerPoint PPT Presentation

Transcript

Matrix Factorizationvia SGD

BACKGROUND

Recovering latent factors in a matrixm movies

n u

sers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

r

W

H

V

MF VIA SGD

Matrix factorization as SGD

step size

local gradient

…scaled up by N to approximate gradient

Key claim:

What loss functions are possible?

ALS = alternating least squares

DISTRIBUTED MF VIA SGD

talk pilfered from …..

KDD 2011

NAACL 2010

Parallel Perceptrons• Simplest idea:

– Split data into S “shards”– Train a perceptron on each shard independently

• weight vectors are w(1) , w(2) , …

– Produce some weighted average of the w(i)‘s as the final result

Parallelizing perceptrons

Instances/labels

Instances/labels – 1 Instances/labels – 2 Instances/labels – 3

vk -1 vk- 2 vk-3

vk

Split into example subsets

Combine by some sort of

weighted averaging

Compute vk’s on subsets

Parallel Perceptrons – take 2

Idea: do the simplest possible thing iteratively.

• Split the data into shards• Let w = 0• For n=1,…• Train a perceptron on each

shard with one pass starting with w

• Average the weight vectors (somehow) and let w be that average

Extra communication cost: • redistributing the weight vectors• done less frequently than if fully synchronized, more frequently than if fully parallelized

All-Reduce

Parallelizing perceptrons – take 2

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

w -1 w- 2 w-3

w

Split into example subsets

Combine by some sort of

weighted averaging

Compute local vk’s

w (previous)

Similar to McDonnell et al with perceptron learning

Slow convergence…..

More detail….• Initialize W,H randomly– not at zero

• Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch”• Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch k • Use “bold driver” to set step size:– increase step size when loss decreases (in an epoch)– decrease step size when loss increases

• Implemented in Hadoop and R/Snowfall

M=

Wall Clock Time8 nodes, 64 cores, R/snowIn-memory implementation

Number of Epochs

Varying rank100 epochs for all

Wall Clock TimeHadoopOne map-reduce job per epoch

Hadoop scalabilityHadoop

process setup time starts to

dominate

Hadoop scalability

top related