Top Banner

Click here to load reader

Large-scale Matrix kijungs/etc/10-405.pdf · PDF file Large-scale Matrix Factorization ... Large-scale Matrix Factorization (by Kijung Shin) 16/99 Machine 1 Machine 2 Machine 3 ...

May 20, 2020

ReportDownload

Documents

others

  • Large-scale Matrix Factorization

    Kijung Shin

    Ph.D. Student, CSD

  • Roadmap •Matrix Factorization (review)

    •Algorithms ◦ Distributed SGD: DSGD

    ◦ Alternating Least Square: ALS

    ◦ Cyclic Coordinate Descent: CCD++

    • Experiments

    • Extension to Tensor Factorization

    •Conclusions

    Large-scale Matrix Factorization (by Kijung Shin) 2/99

  • Roadmap •Matrix Factorization (review)

  • Matrix Factorization: Problem •Given:

    ◦ 𝑽: 𝑛 by 𝑚 matrix

    ▪possibly with missing values

    ◦ 𝑟: target rank (a scalar)

    ▪usually 𝑟

  • Matrix Factorization: Problem •Given:

    ◦ 𝑽: 𝑛 by 𝑚 matrix

    ▪possibly with missing values

    ◦ 𝑟: target rank (a scalar)

    ▪usually 𝑟

  • Matrix Factorization: Problem •Goal: 𝑾𝑯 ≈ 𝑽

    Large-scale Matrix Factorization (by Kijung Shin) 6/99

    𝑾

    𝑯

    𝑽

    ≈× ?

    ?

    ?

    ?

  • Loss Function 𝐿 𝑉,𝑊,𝐻

    = σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 − 𝑊𝐻 𝑖𝑗) 2 + …

    Goal: to make 𝑾𝑯 similar to 𝑽

    Large-scale Matrix Factorization (by Kijung Shin) 7/99

    Indices of non-missing entries

    i.e., 𝑖, 𝑗 ∈ 𝑍 ↔ 𝑉𝑖𝑗 is not missing

    (𝑖, 𝑗)-th entry of 𝑊𝐻

    (𝑖, 𝑗)-th entry of 𝑉

  • Loss Function (cont.) 𝐿 𝑉,𝑊,𝐻

    = σ 𝑖,𝑗 ∈𝑍 (𝑉𝑖𝑗 − 𝑊𝐻 𝑖𝑗) 2

    + 𝜆( 𝑊 𝐹 2 + 𝐻 𝐹

    2)

    Goal: to prevent overfitting

    (by making the entries of 𝑊 and 𝐻 close to zero)

    Large-scale Matrix Factorization (by Kijung Shin) 8/99

    Regularization parameter

    Frobenius Norm:

    𝑊 𝐹 2 =෍

    𝑖=1

    𝑛

    ෍ 𝑘=1

    𝑟

    𝑊𝑖𝑘 2

  • Algorithms •How can we minimize this loss function 𝐿?

    ◦ Stochastic Gradient Descent: SGD (covered in the last lecture)

    ◦ Alternating Least Square: ALS (covered today)

    ◦ Cyclic Coordinate Descent: CCD++ (covered today)

    •Are these algorithms parallelizable?

    • Yes, all of them!

    Large-scale Matrix Factorization (by Kijung Shin) 9/99

  • Roadmap •Matrix Factorization (review)

    •Algorithms ◦ Distributed SGD: DSGD

  • Distributed SGD (DSGD)

    Large-scale matrix factorization with distributed

    stochastic gradient descent (KDD 2011)

    Large-scale Matrix Factorization (by Kijung Shin) 11/99

    Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis

  • Stochastic GD for MF (review)

    • Let 𝑊 =

    − 𝑊1: −

    : − 𝑊𝑛: −

    , 𝐻 =

    | | 𝐻:1 … 𝐻:𝑚 | |

    •𝐿 𝑉,𝑊,𝐻 : sum of loss for each non-missing entry

    𝐿 𝑉,𝑊,𝐻 = ෍

    𝑖,𝑗 ∈𝑍

    𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗

    where loss at each non-missing entry 𝑉𝑖𝑗 is

    𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗 ≔ 𝑉𝑖𝑗 −𝑊𝑖:𝐻:𝑗 2 + 𝜆

    𝑊𝑖: 2

    𝑁𝑖: +

    𝐻:𝑗 2

    𝑁:𝑗

    Large-scale Matrix Factorization (by Kijung Shin) 12/99

    = 𝑊𝐻 𝑖𝑗

    Number of non-missing entries in the 𝑖-th row of 𝑉

  • Stochastic GD for MF (cont.) • Stochastic Gradient Descent (SGD) for MF

    • repeat until convergence ◦ randomly shuffle the non-missing entries of 𝑉

    ◦ for each non-missing entry:

    ▪perform an SGD step on it

    Large-scale Matrix Factorization (by Kijung Shin) 13/99

  • Stochastic GD for MF (cont.) •An SGD step on each non-missing entry 𝑉𝑖𝑗:

    ◦ Read 𝑊𝑖: and 𝐻:𝑗

    ◦ Compute gradient of 𝐿′ 𝑉𝑖𝑗 ,𝑊𝑖:, 𝐻:𝑗 ◦ Update 𝑊𝑖: and 𝐻:𝑗 ▪Detailed update rules were

    covered in the last lecture

    Large-scale Matrix Factorization (by Kijung Shin) 14/99

    𝑽𝑾

    𝑯

    𝑉𝑖𝑗𝑊𝑖:

    𝐻:𝑗

  • Simple Parallel SGD for MF • Parameter Mixing: MSGD

    ◦ entries of 𝑉 are distributed across multiple machines

    Large-scale Matrix Factorization (by Kijung Shin) 15/99

    Machine 1 Machine 2 Machine 3

    𝑽 𝑽 𝑽

  • Simple Parallel SGD for MF (cont.) • Parameter Mixing: MSGD

    ◦ entries of 𝑉 are distributed across multiple machines

    ◦ Map step: each machine runs SGD independently on the assigned entries until convergence

    Large-scale Matrix Factorization (by Kijung Shin) 16/99

    Machine 1 Machine 2 Machine 3

    𝑽

    𝑾

    𝑯

    𝑽

    𝑾

    𝑯

    𝑽

    𝑾

    𝑯

  • • Parameter Mixing: MSGD ◦ Map step: each machine runs

    ▪an independent instance of SGD on subsets of the data

    ▪until convergence

    ◦ Reduce step: average results (i.e., 𝑾 and 𝑯)

    • Problem: does not converge to correct solutions ◦ no guarantee that “the average” reduces the loss function 𝐿

    Large-scale Matrix Factorization (by Kijung Shin) 17/99

    Simple Parallel SGD for MF (cont.)

    𝑯 𝑯 𝑯 𝑯+ +

    Machine 1 Machine 2 Machine 3

    3

  • Simple Parallel SGD for MF (cont.) • Iterative Parameter Mixing: ISGD

    ◦ entries of 𝑉 are distributed across multiple machines

    ◦ Repeat until convergence

    ▪Map step: each machine runs SGD independently on the assigned entries for some time

    ▪Reduce step:

    ◦ average results (i.e., 𝑾 and 𝑯)

    ◦ broadcast averaged results

    • Problem: slow convergence ◦ still has the averaging step

    Large-scale Matrix Factorization (by Kijung Shin) 18/99

  • Interchangeability •How can we avoid the averaging step?

    ◦ let different machines update different entries of W and 𝐻

    • Two entries 𝑉𝑖𝑗 and 𝑉𝑘𝑙 of 𝑽 are interchangeable if they share neither row nor column

    Large-scale Matrix Factorization (by Kijung Shin) 19/99

    𝑽𝑾

    𝑯

    𝑉𝑖𝑗𝑊𝑖:

    𝐻:𝑗

    𝑉𝑖𝑙

    𝐻:𝑙 Not interchangeable!

    𝑽𝑾

    𝑯

    𝑉𝑖𝑗𝑊𝑖:

    𝐻:𝑗

    𝑉𝑘𝑙

    𝐻:𝑙

    𝑊𝑘:

    Interchangeable!

  • Interchangeability (cont.) • If 𝑉𝑖𝑗 and 𝑉𝑘𝑙 are interchangeable,

    • SGD steps on 𝑉𝑖𝑗 and 𝑉𝑘𝑙 can be parallelized safely

    Large-scale Matrix Factorization (by Kijung Shin) 20/99

    Read 𝑊𝑖: and 𝐻:𝑗 Compute gradient of 𝐿′ 𝑊𝑖:, 𝐻:𝑗 Update 𝑊𝑖: and 𝐻:𝑗

    Read 𝑊𝑘: and 𝐻:𝑙 Compute gradient of 𝐿′ 𝑊𝑙:, 𝐻:𝑙 Update 𝑊𝑘: and 𝐻:𝑙

    no conflicts!

    𝑽𝑾

    𝑯

    𝑉𝑖𝑗𝑊𝑖:

    𝐻:𝑗

    𝑉𝑘𝑙

    𝐻:𝑙

    𝑊𝑘:

  • Distributed SGD •Block 𝑾, 𝑯, and 𝑽

    Large-scale Matrix Factorization (by Kijung Shin) 21/99

    𝑊(1)

    𝑊(2)

    𝑊(3)

    𝐻(1) 𝐻(2) 𝐻(3)

    𝑉(1,1) 𝑉(1,2) 𝑉(1,3)

    𝑉(2,1) 𝑉(2,2) 𝑉(2,3)

    𝑉(3,1) 𝑉(3,2) 𝑉(3,3)

    𝑊

    𝐻

    𝑉

  • Distributed SGD (cont.) •Block 𝑾, 𝑯, and 𝑽

    • repeat until convergence ◦ for a set of interchangeable blocks of 𝑉

    ▪1. run SGD on the blocks in parallel

    ◦ no conflict between machines

    ▪2. merge results (i.e., 𝑾 and 𝑯)

    ◦ averaging is not needed

    Large-scale Matrix Factorization (by Kijung Shin) 22/99

    𝑊(1)

    𝑊(2)

    𝑊(3)

    𝐻(1) 𝐻(2) 𝐻(3)

    𝑉(1,1)

    𝑉(2,2)

    𝑉(3,3) Blocks on a diagonal are interchangeable!

  • Interchangeable Blocks • Example of interchangeable blocks

    Large-scale Matrix Factorization (by Kijung Shin) 23/99

    𝑊(1)

    𝑊(2)

    𝑊(3)

    𝐻(1) 𝐻(2) 𝐻(3)

    𝑉(1,1)

    𝑉(2,2)

    𝑉(3,3)

    Machine 1 Machine 2

    Machine 3 Blocks are interchangeable!

    𝑊(1)

    𝐻(1)

    𝑉(1,1) 𝑊(2)

    𝐻(2)

    𝑉(2,2)

    𝑊(3)

    𝐻(3)

    𝑉(3,3)

  • Interchangeable Blocks (cont.) • Example of interchangeable blocks

    Large-scale Matrix Factorization (by Kijung Shin) 24/99

    𝑊(1)

    𝑊(2)

    𝑊(3)

    𝐻(1) 𝐻(2) 𝐻(3)

    𝑉(1,2)

    𝑉(2,3)

    𝑉(3,1)

    𝑊(1)

    𝐻(2)

    𝑉(1,2) 𝑊(2)

    𝐻(3)

    𝑉(2,3)

    𝑊(3)

    𝐻(1)

    𝑉(3,1)

    Machine 1 Machine 2

    Machine 3 Blocks are interchangeable!

  • Interchangeable Blocks (cont.) • Example of interchangeable blocks

    Large-scale Matrix Factorization (by Kijung Shin) 25/99

    𝑊(1)

    𝑊(2)

    𝑊(3)

    𝐻(1) 𝐻(2) 𝐻(3)

    𝑉(1,3)

    𝑉(2,1)

    𝑉(3,2)

    𝑊(1)

    𝐻(3)

    𝑉(1,3) 𝑊(2)

    𝐻(1)

    𝑉(2,1)