Matrix Factorization via SGD

Matrix Factorizationvia SGD

BACKGROUND

Recovering latent factors in a matrixm movies

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

V[i,j] = user i’s rating of movie j

MF VIA SGD

Matrix factorization as SGD

step size

local gradient

…scaled up by N to approximate gradient

Key claim:

What loss functions are possible?

ALS = alternating least squares

DISTRIBUTED MF VIA SGD

talk pilfered from …..

KDD 2011

NAACL 2010

Parallel Perceptrons• Simplest idea:

– Split data into S “shards”– Train a perceptron on each shard independently

• weight vectors are w(1) , w(2) , …

– Produce some weighted average of the w(i)‘s as the final result

Parallelizing perceptrons

Instances/labels

Instances/labels – 1 Instances/labels – 2 Instances/labels – 3

vk -1 vk- 2 vk-3

Split into example subsets

Combine by some sort of

weighted averaging

Compute vk’s on subsets

Parallel Perceptrons – take 2

Idea: do the simplest possible thing iteratively.

• Split the data into shards• Let w = 0• For n=1,…• Train a perceptron on each

shard with one pass starting with w

• Average the weight vectors (somehow) and let w be that average

Extra communication cost: • redistributing the weight vectors• done less frequently than if fully synchronized, more frequently than if fully parallelized

All-Reduce

Parallelizing perceptrons – take 2

Instances/labels

Instances/labels – 1

w -1 w- 2 w-3

Split into example subsets

Combine by some sort of

weighted averaging

Compute local vk’s

w (previous)

Similar to McDonnell et al with perceptron learning

Slow convergence…..

More detail….• Initialize W,H randomly– not at zero

• Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch”• Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch k • Use “bold driver” to set step size:– increase step size when loss decreases (in an epoch)– decrease step size when loss increases

• Implemented in Hadoop and R/Snowfall

Wall Clock Time8 nodes, 64 cores, R/snowIn-memory implementation

Number of Epochs

Varying rank100 epochs for all

Wall Clock TimeHadoopOne map-reduce job per epoch

Hadoop scalabilityHadoop

process setup time starts to

dominate

Hadoop scalability

Matrix Factorization via SGD

set of blocks

nd blocks

seriesprocess blocks

blocks of size d x dmd

epochdecrease step size

weighted average

split data

sgd step sizelocal

Documents

Nonnegative Matrix Factorization - Complexity, Algorithms...

HHMF: hidden hierarchical matrix factorization for ... ·.....

Convolutional Matrix Factorization for Document Context...

Non-negative Multiple Matrix Factorization

Scalable Task Parallel SGD on Matrix Factorization in...

GPU Acceleration of Sparse Matrix Factorization in...

Matrix Factorization

NON-NEGATIVE MATRIX FACTORIZATION FOR …

The Incremental Multiresolution Matrix Factorization...

Non Negative Matrix Factorization Hamdi Jenzri. Outline ...

Empirical Bayes Matrix Factorization

Matrix Factorization For Topic Models

Collaborative Filtering Matrix Factorization...

Singular Value Decomposition (matrix factorization)

Matrix Factorization Algorithms for the Identification of...

CuMF SGD: Fast and Scalable Matrix FactorizationCuMF_SGD:...