# Large-scale Matrix kijungs/etc/10-405.pdf · PDF file Large-scale Matrix Factorization ... Large-scale Matrix Factorization (by Kijung Shin) 16/99 Machine 1 Machine 2 Machine 3 ...

May 20, 2020

## Documents

others

• Large-scale Matrix Factorization

Kijung Shin

Ph.D. Student, CSD

• Roadmap •Matrix Factorization (review)

•Algorithms ◦ Distributed SGD: DSGD

◦ Alternating Least Square: ALS

◦ Cyclic Coordinate Descent: CCD++

• Experiments

• Extension to Tensor Factorization

•Conclusions

Large-scale Matrix Factorization (by Kijung Shin) 2/99

• Roadmap •Matrix Factorization (review)

• Matrix Factorization: Problem •Given:

◦ 𝑽: 𝑛 by 𝑚 matrix

▪possibly with missing values

◦ 𝑟: target rank (a scalar)

▪usually 𝑟

• Matrix Factorization: Problem •Given:

◦ 𝑽: 𝑛 by 𝑚 matrix

▪possibly with missing values

◦ 𝑟: target rank (a scalar)

▪usually 𝑟

• Matrix Factorization: Problem •Goal: 𝑾𝑯 ≈ 𝑽

Large-scale Matrix Factorization (by Kijung Shin) 6/99

𝑾

𝑯

𝑽

≈× ?

?

?

?

• Loss Function 𝐿 𝑉,𝑊,𝐻

= σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 − 𝑊𝐻 𝑖𝑗) 2 + …

Goal: to make 𝑾𝑯 similar to 𝑽

Large-scale Matrix Factorization (by Kijung Shin) 7/99

Indices of non-missing entries

i.e., 𝑖, 𝑗 ∈ 𝑍 ↔ 𝑉𝑖𝑗 is not missing

(𝑖, 𝑗)-th entry of 𝑊𝐻

(𝑖, 𝑗)-th entry of 𝑉

• Loss Function (cont.) 𝐿 𝑉,𝑊,𝐻

= σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 − 𝑊𝐻 𝑖𝑗) 2

+ 𝜆( 𝑊 𝐹 2 + 𝐻 𝐹

2)

Goal: to prevent overfitting

(by making the entries of 𝑊 and 𝐻 close to zero)

Large-scale Matrix Factorization (by Kijung Shin) 8/99

Regularization parameter

Frobenius Norm:

𝑊 𝐹 2 =෍

𝑖=1

𝑛

෍ 𝑘=1

𝑟

𝑊𝑖𝑘 2

• Algorithms •How can we minimize this loss function 𝐿?

◦ Stochastic Gradient Descent: SGD (covered in the last lecture)

◦ Alternating Least Square: ALS (covered today)

◦ Cyclic Coordinate Descent: CCD++ (covered today)

•Are these algorithms parallelizable?

• Yes, all of them!

Large-scale Matrix Factorization (by Kijung Shin) 9/99

• Roadmap •Matrix Factorization (review)

•Algorithms ◦ Distributed SGD: DSGD

• Distributed SGD (DSGD)

Large-scale matrix factorization with distributed

stochastic gradient descent (KDD 2011)

Large-scale Matrix Factorization (by Kijung Shin) 11/99

Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis

• Stochastic GD for MF (review)

• Let 𝑊 =

− 𝑊1: −

: − 𝑊𝑛: −

, 𝐻 =

| | 𝐻:1 … 𝐻:𝑚 | |

•𝐿 𝑉,𝑊,𝐻 : sum of loss for each non-missing entry

𝐿 𝑉,𝑊,𝐻 = ෍

𝑖,𝑗 ∈𝑍

𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗

where loss at each non-missing entry 𝑉𝑖𝑗 is

𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗 ≔ 𝑉𝑖𝑗 −𝑊𝑖:𝐻:𝑗 2 + 𝜆

𝑊𝑖: 2

𝑁𝑖: +

𝐻:𝑗 2

𝑁:𝑗

Large-scale Matrix Factorization (by Kijung Shin) 12/99

= 𝑊𝐻 𝑖𝑗

Number of non-missing entries in the 𝑖-th row of 𝑉

• Stochastic GD for MF (cont.) • Stochastic Gradient Descent (SGD) for MF

• repeat until convergence ◦ randomly shuffle the non-missing entries of 𝑉

◦ for each non-missing entry:

▪perform an SGD step on it

Large-scale Matrix Factorization (by Kijung Shin) 13/99

• Stochastic GD for MF (cont.) •An SGD step on each non-missing entry 𝑉𝑖𝑗:

◦ Read 𝑊𝑖: and 𝐻:𝑗

◦ Compute gradient of 𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗 ◦ Update 𝑊𝑖: and 𝐻:𝑗 ▪Detailed update rules were

covered in the last lecture

Large-scale Matrix Factorization (by Kijung Shin) 14/99

𝑽𝑾

𝑯

𝑉𝑖𝑗𝑊𝑖:

𝐻:𝑗

• Simple Parallel SGD for MF • Parameter Mixing: MSGD

◦ entries of 𝑉 are distributed across multiple machines

Large-scale Matrix Factorization (by Kijung Shin) 15/99

Machine 1 Machine 2 Machine 3

𝑽 𝑽 𝑽

• Simple Parallel SGD for MF (cont.) • Parameter Mixing: MSGD

◦ entries of 𝑉 are distributed across multiple machines

◦ Map step: each machine runs SGD independently on the assigned entries until convergence

Large-scale Matrix Factorization (by Kijung Shin) 16/99

Machine 1 Machine 2 Machine 3

𝑽

𝑾

𝑯

𝑽

𝑾

𝑯

𝑽

𝑾

𝑯

• • Parameter Mixing: MSGD ◦ Map step: each machine runs

▪an independent instance of SGD on subsets of the data

▪until convergence

◦ Reduce step: average results (i.e., 𝑾 and 𝑯)

• Problem: does not converge to correct solutions ◦ no guarantee that “the average” reduces the loss function 𝐿

Large-scale Matrix Factorization (by Kijung Shin) 17/99

Simple Parallel SGD for MF (cont.)

𝑯 𝑯 𝑯 𝑯+ +

Machine 1 Machine 2 Machine 3

3

• Simple Parallel SGD for MF (cont.) • Iterative Parameter Mixing: ISGD

◦ entries of 𝑉 are distributed across multiple machines

◦ Repeat until convergence

▪Map step: each machine runs SGD independently on the assigned entries for some time

▪Reduce step:

◦ average results (i.e., 𝑾 and 𝑯)

◦ broadcast averaged results

• Problem: slow convergence ◦ still has the averaging step

Large-scale Matrix Factorization (by Kijung Shin) 18/99

• Interchangeability •How can we avoid the averaging step?

◦ let different machines update different entries of W and 𝐻

• Two entries 𝑉𝑖𝑗 and 𝑉𝑘𝑙 of 𝑽 are interchangeable if they share neither row nor column

Large-scale Matrix Factorization (by Kijung Shin) 19/99

𝑽𝑾

𝑯

𝑉𝑖𝑗𝑊𝑖:

𝐻:𝑗

𝑉𝑖𝑙

𝐻:𝑙 Not interchangeable!

𝑽𝑾

𝑯

𝑉𝑖𝑗𝑊𝑖:

𝐻:𝑗

𝑉𝑘𝑙

𝐻:𝑙

𝑊𝑘:

Interchangeable!

• Interchangeability (cont.) • If 𝑉𝑖𝑗 and 𝑉𝑘𝑙 are interchangeable,

• SGD steps on 𝑉𝑖𝑗 and 𝑉𝑘𝑙 can be parallelized safely

Large-scale Matrix Factorization (by Kijung Shin) 20/99

Read 𝑊𝑖: and 𝐻:𝑗 Compute gradient of 𝐿′ 𝑊𝑖:, 𝐻:𝑗 Update 𝑊𝑖: and 𝐻:𝑗

Read 𝑊𝑘: and 𝐻:𝑙 Compute gradient of 𝐿′ 𝑊𝑙:, 𝐻:𝑙 Update 𝑊𝑘: and 𝐻:𝑙

no conflicts!

𝑽𝑾

𝑯

𝑉𝑖𝑗𝑊𝑖:

𝐻:𝑗

𝑉𝑘𝑙

𝐻:𝑙

𝑊𝑘:

• Distributed SGD •Block 𝑾, 𝑯, and 𝑽

Large-scale Matrix Factorization (by Kijung Shin) 21/99

𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,1) 𝑉(1,2) 𝑉(1,3)

𝑉(2,1) 𝑉(2,2) 𝑉(2,3)

𝑉(3,1) 𝑉(3,2) 𝑉(3,3)

𝑊

𝐻

𝑉

• Distributed SGD (cont.) •Block 𝑾, 𝑯, and 𝑽

• repeat until convergence ◦ for a set of interchangeable blocks of 𝑉

▪1. run SGD on the blocks in parallel

◦ no conflict between machines

▪2. merge results (i.e., 𝑾 and 𝑯)

◦ averaging is not needed

Large-scale Matrix Factorization (by Kijung Shin) 22/99

𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,1)

𝑉(2,2)

𝑉(3,3) Blocks on a diagonal are interchangeable!

• Interchangeable Blocks • Example of interchangeable blocks

Large-scale Matrix Factorization (by Kijung Shin) 23/99

𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,1)

𝑉(2,2)

𝑉(3,3)

Machine 1 Machine 2

Machine 3 Blocks are interchangeable!

𝑊(1)

𝐻(1)

𝑉(1,1) 𝑊(2)

𝐻(2)

𝑉(2,2)

𝑊(3)

𝐻(3)

𝑉(3,3)

• Interchangeable Blocks (cont.) • Example of interchangeable blocks

Large-scale Matrix Factorization (by Kijung Shin) 24/99

𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,2)

𝑉(2,3)

𝑉(3,1)

𝑊(1)

𝐻(2)

𝑉(1,2) 𝑊(2)

𝐻(3)

𝑉(2,3)

𝑊(3)

𝐻(1)

𝑉(3,1)

Machine 1 Machine 2

Machine 3 Blocks are interchangeable!

• Interchangeable Blocks (cont.) • Example of interchangeable blocks

Large-scale Matrix Factorization (by Kijung Shin) 25/99

𝑊(1)

𝑊(2)

𝑊(3)

𝐻(1) 𝐻(2) 𝐻(3)

𝑉(1,3)

𝑉(2,1)

𝑉(3,2)

𝑊(1)

𝐻(3)

𝑉(1,3) 𝑊(2)

𝐻(1)

𝑉(2,1)

Related Documents See more >
##### Structured Low-Rank Matrix Factorization: Optimality ......
Category: Documents
##### Large-Scale Matrix Factorization with Distributed ......
Category: Documents
##### Nonnegative Matrix Factorization for .Nonnegative Matrix...
Category: Documents
##### Nonnegative Matrix Factorization for Clustering Nonnegative....
Category: Documents
##### Positive Matrix Factorization (PMF) 5.0 .EPA Positive Matrix...
Category: Documents
##### Deep Matrix Factorization Models for Recommender .Deep...
Category: Documents