GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning Fabio Petroni Cyber Intelligence and information Security CIS Sapienza Department of Computer Control and Management Engineering Antonio Ruberti, Sapienza University of Rome - petroni|[email protected]Foster City, Silicon Valley 6th-10th October 2014 and Leonardo Querzoni
18
Embed
GASGD: Stochastic Gradient Descent for Distributed ...midlab.diag.uniroma1.it/articoli/presentation-recsys.pdf · GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GASGD: Stochastic Gradient Descent for DistributedAsynchronous Matrix Completion via Graph Partitioning
Fabio Petroni
Cyber Intelligence and information Security
CIS Sapienza
Department of Computer Control andManagement Engineering Antonio Ruberti,
I Stochastic Gradient Descent works by taking stepsproportional to the negative of the gradient of the LOSS.
I stochastic = P and Q are updated for each given trainingcase by a small step, toward the average gradient descent.
2 of 18
Scalability
7 Lengthy training stages;
7 high computational costs;
7 especially on large data sets;
7 input data may not fit in main memory.
I goal = e�ciently exploit computer cluster architectures.
...
3 of 18
Distributed Asynchronous SGD
R
ĐŽŵƉƵƟŶŐ�ŶŽĚĞƐ
Q3
P3
Q4
P4
Q2
P2
Q1
P1
1 2 3 4
R1 R4
R3R2
I R is splitted;
I vectors arereplicated;
I replicasconcurrentlyupdated;
I replicas deviateinconsistently;
I synchronization.
4 of 18
Bulk Synchronous Processing Model
ĐŽŵƉƵƟŶŐ�ŶŽĚĞƐ
ϭ͘�ůŽĐĂůĐŽŵƉƵƚĂƟŽŶ
Ϯ͘�ĐŽŵŵƵŶŝĐĂƟŽŶ
ϯ͘�ďĂƌƌŝĞƌƐLJŶĐŚƌŽŶŝnjĂƟŽŶ
5 of 18
Challenges 1/2
R
1 2 3 4
R1 R4
R3R2
I 1. Load balanceB
ensure that computing nodes are
fed with the same load.
I 2. Minimize communicationB
minimize vector replicas.
6 of 18
Challenges 2/2
...
I 3. Tune synchronization frequencyamong computing nodes.
I Current implementations synchronizevector copies:B
continuously during the epoch
(waste of resources);
Bonce after every epoch
(slow convergence).
Iepoch = a single iteration over the ratings.
7 of 18
Contributions
3 We mitigate the load imbalance by proposing an inputslicing solution based on graph partitioning algorithms;
3 we show how to reduce the number of shared data byproperly leveraging known characteristics of the input dataset(bipartite power-law nature);
3 we show how to leverage the tradeo↵ betweencommunication cost and algorithm convergence rate bytuning the frequency of the bulk synchronization phaseduring the computation.
8 of 18
Graph representation
I The rating matrix describes a bipartite graph.
x
x x
x
x
x
xx
x
x
x
x
R
I Real data: skewed power-law degree distribution.
9 of 18
Input partitioner
I vertex-cut performs better than edge-cut in power-law graphs.
I Assumption: the inputdata doesn’t fit inmain memory.
I Streaming algorithm.
I Balanced k-wayvertex-cut graphpartitioning:B
minimize replicas;
Bbalance edge load.
10 of 18
Balanced Vertex-Cut Streaming Algorithms
I Hashing: pseudo-random edge assignment;I Grid: shu✏e and split the rating matrix in identical blocks;I Greedy: [Gonzalez et al. 2012] and [Ahmed et al. 2013].
Bipartite Aware Greedy AlgorithmI Real word bipartite graphs are often significantly skewed: oneof the two sets is much bigger than the other.
I By perfectly splitting the bigger set it is possible to achieve asmaller replication factor.
I GIP (Greedy - Item Partitioned)I GUP (Greedy - User Partitioned)
11 of 18
Evaluation: The Data Sets
Degree distributions:
100
101
102
103
104
100 101 102 103 104 105
num
ber
of ve
rtic
es
degree
itemsusers
MovieLens 10M
100
101
102
103
104
100 101 102 103 104 105
num
ber
of ve
rtic
es
degree
itemsusers
Netflix
12 of 18
Experiments: Partitioning quality
MovieLens
1
10
100
8 16 32 64 128 256
replic
atio
n fact
or
partitions
hashinggridgreedy
GIPGUP
1
10
100
8 16 32 64 128 256
loa
d s
tan
da
rd d
evi
atio
n (
%)
partitions
hashinggridgreedy
GIPGUP
Netflix
1
10
100
8 16 32 64 128 256re
plic
atio
n fact
or
partitions
hashinggridgreedy
GIPGUP
1
10
100
8 16 32 64 128 256
loa
d s
tan
da
rd d
evi
atio
n (
%)
partitions
hashinggridgreedy
GIPGUP
I RF
ReplicationFactor
I RSD
RelativeStandardDeviation
13 of 18
Synchronization frequency
f = 1
f = 100
Netflix
567
10
20
30
405060
0 5 10 15 20 25 30
loss
(× 1
07)
epoch
gridgreedy
GUPcentralized
56789
10
20
30
405060
0 5 10 15 20 25 30
loss
(× 1
07)
epoch
gridgreedy
GUPcentralized
I f = synchronizationfrequency parameterB
number of
synchronization steps
during an epoch.
I tradeo↵ betweencommunication cost andconvergence rate.
14 of 18
Evaluation: SSE and Communication cost
MovieLens
1011
1012
1013
1014
1015
100 101 102 103
SS
E
frequency
grid, |C|=32greedy, |C|=32
GUP, |C|=32grid, |C|=8
greedy, |C|=8GUP, |C|=8
1
2
3
4
5
6
7
8
9
10
100 101 102 103 104 105
com
munaca
tion c
ost
(× 1
07)
frequency
hashinggrid
greedyGUP
0.010.05
0.1
0.15
0.2
0.25
101
Zoom
Netflix
1013
1014
1015
1016
1017
100 101 102 103
SS
Efrequency
grid, |C|=32greedy, |C|=32
GUP, |C|=32grid, |C|=8
greedy, |C|=8GUP, |C|=8
1
2
3
4
5
6
7
8
9
10
100 101 102 103 104 105 106
com
munaca
tion c
ost
(× 1
09)
frequency
hashinggrid
greedyGUP
0.01
0.05
0.1
0.15
0.2
101
Zoom
I SSE
between
ASGD
variants and
SGD curves
I CC
15 of 18
Communication cost
I T = the training set
I U = users set
I I = items set
I V = U [ I
I C = processing nodes
I RF = replication factor
I RFU = users’ RF
I RFI = items’ RF
RF =|U |RFU + |I |RFI
|V |
f = 1 ! CC ⇡ 2(|U |RFU + |I |RFI ) = 2|V |RF
f =|T ||C | ! CC ⇡ |T |(RFU + RFI )
16 of 18
Conclusions
I three distinct contributions aimed at improving the e�ciencyand scalability of ASGD:
1. we proposed an input slicing solution based on graphpartitioning approach that mitigates the load imbalanceamong SGD instances (i.e. better scalability);
2. we further reduce the amount of shared data by exploitingspecific characteristics of the training dataset. This provideslower communication costs during the algorithm execution(i.e. better e�ciency);
3. we introduced a synchronization frequency parameter drivinga tradeo↵ that can be accurately leveraged to further improvethe algorithm e�ciency.
17 of 18
Thank you!
Questions?
Fabio Petroni
Rome, Italy
Current position:
PhD Student in Computer Engineering, Sapienza University of Rome
Research Areas:
Recommendation Systems, Collaborative Filtering, Distributed Systems