NOMAD: A Distributed Framework for Latent Variable Models Inderjit S. Dhillon Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun, and S.V.N. Vishwanathan NIPS 2014 Workshop: Distributed Machine Learning and Matrix Computations Inderjit Dhillon (UT Austin.) Dec 12, 2014 1 / 40
53
Embed
NOMAD: A Distributed Framework for Latent Variable Modelsstanford.edu/~rezab/nips2014workshop/slides/inderjit.pdfNOMAD: A Distributed Framework for Latent Variable Models Inderjit
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NOMAD: A Distributed Framework forLatent Variable Models
Inderjit S. Dhillon
Department of Computer ScienceUniversity of Texas at Austin
Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun,and S.V.N. Vishwanathan
NIPS 2014 Workshop:Distributed Machine Learning and Matrix Computations
Latent Variable Models: very useful in many applications
Latent models for recommender systems (e.g., MF)Topic models for document corpus (e.g., LDA)
Fast growth of data
Almost 2.5× 1018 bytes of data added each day90% of the world’s data today was generated in the past two year
Inderjit Dhillon (UT Austin.) Dec 12, 2014 3 / 40
Challenges
Algorithmic as well as hardware level
Many effective algorithms involve fine-grain iterative computation⇒ hard to parallelizeMany current parallel approaches
bulk synchronization⇒ wasted CPU power when communicatingcomplicated locking mechanism⇒ hard to scale to many machinesasynchronous computation using parameter server⇒ not serializable, danger of stale parameters
Proposed NOMAD Framework
access graph analysis to exploit parallelismasynchronous computation, non-blocking communication, and lock-freeserializable (or almost serializable)successful applications: MF and LDA
Inderjit Dhillon (UT Austin.) Dec 12, 2014 4 / 40
Matrix Factorization:
Recommender Systems
Inderjit Dhillon (UT Austin.) Dec 12, 2014 5 / 40
Recommender Systems
Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40
Matrix Factorization Approach A ≈ WHT
Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40
Matrix Factorization Approach A ≈ WHT
Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40
Matrix Factorization Approach
minW∈Rm×k
H∈Rn×k
∑(i ,j)∈Ω
(Aij − wTi hj)
2 + λ(‖W ‖2
F + ‖H‖2F
),
Ω = (i , j) | Aij is observedRegularized terms to avoid over-fitting
A transform maps users/items to latent feature space Rk
the i th user ⇒ i th row of W , wTi ,
the j th item ⇒ j th column of HT , hj .
wTi hj : measures the interaction.
Inderjit Dhillon (UT Austin.) Dec 12, 2014 7 / 40
SGM: Stochastic Gradient Method
SGM update: pick (i , j) ∈ ΩRij ← Aij − wT
i hj ,
w i ← w i − η( λ|Ωi |w i − Rijhj),
hj ← hj − η( λ|Ωj |
hj − Rijw i ),
Ωi : observed ratings of i-th row.
Ωj : observed ratings of j-th column.
wT1
wT2
wT3
A11 A12 A13
A21 A22 A23
A31 A32 A33
h1 h2 h3
( )
An iteration : |Ω| updates
Time per update: O(k)
Time per iteration: O(|Ω|k),
better than O(|Ω|k2) for ALS
Inderjit Dhillon (UT Austin.) Dec 12, 2014 8 / 40
Parallel Stochastic Gradient Descent for MF
Challenge: direct parallel updates ⇒ memory conflicts.
Multi-core parallelization
Hogwild [Niu 2011]Jellyfish [Recht et al, 2011]FPSGD** [Zhuang et al, 2013]
Multi-machine parallelization:
DSGD [Gemulla et al, 2011]DSGD ++ [Teflioudi et al, 2013]
Inderjit Dhillon (UT Austin.) Dec 12, 2014 9 / 40
DSGD/JellyFish [Gemulla et al, 2011; Recht et al, 2011]
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
Synchronize and communicate
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
Synchronize and communicateInderjit Dhillon (UT Austin.) Dec 12, 2014 10 / 40
Proposed Asynchronous Approach:
NOMAD-MF [Yun et al, 2014]
Inderjit Dhillon (UT Austin.) Dec 12, 2014 11 / 40
Motivation
Most existing parallel approaches require
Synchronization and/or
E.g., ALS, DSGD/JellyFish, DSGD++, CCD++Computing power is wasted:
Interleaved computation and communicationCurse of the last reducer
Locking and/or
E.g., parallel SGD, FPSGD**A standard way to avoid conflict and guarantee serializabilityComplicated remote locking slows down the computationHard to implement efficient locking on a distributed system
Computation using stale valuesE.g., Hogwild, Asynchronous SGD using parameter serverLack of serializability
Q: Can we avoid both synchronization and locking but keep CPU frombeing idle and guarantee serializability?
Inderjit Dhillon (UT Austin.) Dec 12, 2014 12 / 40
Our answer: NOMAD
A: Yes, NOMAD keeps CPU and network busy simultaneously
Stochastic gradient update rule
only a small set of variables involved
Nomadic token passing
widely used in telecommunication areaavoids conflict without explicit remote lockingIdea: “owner computes”NOMAD: multiple “active tokens” and nomadic passing
Features:
fully asynchronous computation
lock-free implementation
non-blocking communication
serializable update sequence
Inderjit Dhillon (UT Austin.) Dec 12, 2014 13 / 40
Access Graph for Stochastic Gradient
Access graph G = (V ,E ):
V = wi ∪ hjE = eij : (i , j) ∈ Ω
Connection to SG:
each eij corresponds to a SG updateonly access to wi and hj
Parallelism:
edges without common node can beupdated in parallelidentify “matching” in the graph
Nomadic Token Passing:
mechanism s.t. active edges alwaysform a “matching”serializability guaranteed
users items
hj
wi
Inderjit Dhillon (UT Austin.) Dec 12, 2014 14 / 40
More Details
Nomadic Tokens for hj:n tokens
(j , hj): O(k) space
Worker:
p workers
a computing unit + a concurrenttoken queue
a block of W : O(mk/p)
a block row of A: O(|Ω|/p) x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 15 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40
Comparison on a Multi-core System
On a 32-core processor with enough RAM.
Comparison: NOMAD, FPSGD**, and CCD++.
0 100 200 300 4000.91
0.92
0.93
0.94
0.95
seconds
test
RM
SE
Netflix, machines=1, cores=30, λ = 0.05, k = 100
NOMAD
FPSGD**
CCD++
(100M ratings)
0 100 200 300 400
22
24
26
seconds
test
RMSE
Yahoo!, machines=1, cores=30, λ = 1.00, k = 100
NOMAD
FPSGD**
CCD++
(250M ratings)
Inderjit Dhillon (UT Austin.) Dec 12, 2014 17 / 40
Comparison on a Distributed System
On a distributed system with 32 machines.
Comparison: NOMAD, DSGD, DSGD++, and CCD++.
0 20 40 60 80 100 120
0.92
0.94
0.96
0.98
1
seconds
test
RM
SE
Netflix, machines=32, cores=4, λ = 0.05, k = 100
NOMAD
DSGD
DSGD++
CCD++
(100M ratings)
0 20 40 60 80 100 120
22
24
26
seconds
test
RMSE
Yahoo!, machines=32, cores=4, λ = 1.00, k = 100
NOMAD
DSGD
DSGD++
CCD++
(250M ratings)
Inderjit Dhillon (UT Austin.) Dec 12, 2014 18 / 40
Super Linear Scaling of NOMAD-MF
0 0.5 1 1.5 2
·104
22
23
24
25
26
27
seconds × machines × cores
test
RMSE
Yahoo!, cores=4, λ = 1.00, k = 100
# machines=1
# machines=2
# machines=4
# machines=8
# machines=16
# machines=32
Inderjit Dhillon (UT Austin.) Dec 12, 2014 19 / 40
Topic Modeling:
Latent Dirichlet Allocation
Inderjit Dhillon (UT Austin.) Dec 12, 2014 20 / 40
Latent Dirichlet Allocation (LDA)
Each topic is a multinomial distribution over words
Each document is a multinomial distribution over topics
Each word is drawn from one of these topics
1source: http://www.cs.columbia.edu/~blei/papers/icml-2012-tutorial.pdfInderjit Dhillon (UT Austin.) Dec 12, 2014 21 / 40
Inderjit Dhillon (UT Austin.) Dec 12, 2014 22 / 40
Inference for LDA
Only documents are observed
θt , φt , zi ,j are latent
Goal: infer these latent structures1source: http://www.cs.columbia.edu/~blei/papers/icml-2012-tutorial.pdfInderjit Dhillon (UT Austin.) Dec 12, 2014 23 / 40
Givena corpus of documents di : i = 1, . . . ,N, α, βeach document di = wi,j : j = 1, . . . , ni
Exact inference for zi ,j , θi , φtIntractableLatent variables are dependent when conditioned on data
Approximate Inference approaches:
Variational Methods
See [Blei et al, 2003]an optimization approachruns fasterbut generates biased results
Gibbs Samplings
See [Griffiths & Steyvers, 2004]an MCMC approachmore accuratebut slower with a vanillaimplementation
Goal: Design a scalable Gibbs sampler for LDA
Inderjit Dhillon (UT Austin.) Dec 12, 2014 24 / 40
Gibbs Sampling for LDA [Griffiths & Steyvers, 2004]
Count matrices for topic assignment zi ,j:ndt : # words of document d assigned to topic tnwt : # of times word w assigned to topic tnt :=
∑w nwt =
∑d ndt
Gibbs Sampling Step1 choose w := wi,j with old assignment to := zi,j of document d := di2 Decrease ndto , nwto , nto by 13 Resample a new assignment tn := zi,j according to
Pr(zi,j = t) ∝ (ndt + α) (nwt + β)
nt + β, ∀t = 1, . . . ,T .
4 Increase ndtn , nwtn , ntn by 1
Constants
J: vocabulary sizeβ = β × J
Inderjit Dhillon (UT Austin.) Dec 12, 2014 25 / 40
Access Pattern for Gibbs Sampling
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
WordsTopics
Docs
zij
nwt
ndt
nt
Inderjit Dhillon (UT Austin.) Dec 12, 2014 26 / 40
Multinomial Sampling Techniques for p ∈ R>+
Initialization Generation Parameter UpdateTime Space Time Time
F+LDA: word-by-word faster than doc-by-doc for large I
|Td | bounded by ni , but |Tw | approaches to Tper Gibbs step cost: ρF logT + ρB |Td |
SparseLDA:
per Gibbs step cost: Θ(T + |Td |+ |Tw |)the first Θ(T ) rarely happens but |Tw | → T for large I
AliasLDA:
per Gibbs step cost: ρA|Td |+ #MHρA ≈ 3× ρB : construction overhead of Alias tableIf (ρA − ρB) |Td | > ρF logT ⇒ AliasLDA slower than F+LDAsay |Td | ≈ 100, F+LDA still faster for T < 250
Inderjit Dhillon (UT Austin.) Dec 12, 2014 33 / 40
Comparison of various sampling methods
Single machine, single thread
y-axis: speedup over normal O(T ) multinomial sampling
Enron: 38K docs with 6M tokens
NyTimes: 0.3M docs with 100M tokens
Inderjit Dhillon (UT Austin.) Dec 12, 2014 34 / 40
Access Pattern for Gibbs Sampling
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
WordsTopics
Docs
zij
nwt
ndt
nt
Inderjit Dhillon (UT Austin.) Dec 12, 2014 35 / 40
Access Graph for Gibbs Sampling
G = (V ,E ): a hyper graph
V = di ∪ wj ∪ sE = eij = (di ,wj , s)
Connection to Gibbs sampling
(di )t := ndi t , (wj)t := nwj t , (s)t := nteach eij : a Gibbs step for word wj in diaccess to (di ,wj , s)
Parallelism: more challenging
all edges incident to sall (s)t are large in general⇒ slightly stale s is fine for accuracyduplicate s for parallelism
words documents
s
summation node
di
wj
Inderjit Dhillon (UT Austin.) Dec 12, 2014 36 / 40
Nomadic Tokens for wj
Nomadic Tokens forwj : j = 1, . . . , J:
J tokens
(j ,wj): O(T ) space
Worker:
p workers
a computing unit + a concurrenttoken queue
a subset of di: O(IT/p)
“x”: an occurrence of a word
bigger rectangle: a subset ofcorpus
smaller rectangle: a unit subtask
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
Inderjit Dhillon (UT Austin.) Dec 12, 2014 37 / 40
Nomadic Token for s: Circular Delta Update
Single global stravels among machines as amessengerbroadcasts local delta updates
Every machine p: (sp, s)sp: local working copys: snapshot version of global s
s1, s
s2, s
s3, s
s4, s
s← s + (s3 − s)s← ss3 ← s
s
Inderjit Dhillon (UT Austin.) Dec 12, 2014 38 / 40
Comparison on a single multi-core machine
On a machine with a 20-core processor
Comparison: F+NOMAD LDA, Yahoo! LDA
PubMed: 9M docs with 700M tokens
Amazon: 30M docs with 1.5B tokens
Inderjit Dhillon (UT Austin.) Dec 12, 2014 39 / 40
Comparison on a Multi-machine System
32 machines, each with a 20-core processor.
Comparison: F+NOMAD LDA, Yahoo! LDA
Amazon: 30M docs with 1.5B tokens
UMBC: 40M docs with 1.5B tokens
Inderjit Dhillon (UT Austin.) Dec 12, 2014 40 / 40
Conclusions
NOMAD framework uses nomadic tokens to provide
Asynchronous computationNon-blocking communicationLock-free implementationSerializable or near Serializable