GASGD: Stochastic Gradient Descent for Distributed ...midlab.diag.uniroma1.it/articoli/presentation-recsys.pdf · GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix

GASGD: Stochastic Gradient Descent for DistributedAsynchronous Matrix Completion via Graph Partitioning

Fabio Petroni

Cyber Intelligence and information Security

CIS Sapienza

Department of Computer Control andManagement Engineering Antonio Ruberti,

Sapienza University of Rome -petroni|[email protected]

Foster City, Silicon Valley6th-10th October 2014

and Leonardo Querzoni

Matrix Completion & SGD

R Qx

P

?item vectoruser vector

LOSS(P ,Q) =X

(Rij � PiQj)2 + ...

I Stochastic Gradient Descent works by taking stepsproportional to the negative of the gradient of the LOSS.

I stochastic = P and Q are updated for each given trainingcase by a small step, toward the average gradient descent.

2 of 18

Scalability

7 Lengthy training stages;

7 high computational costs;

7 especially on large data sets;

7 input data may not fit in main memory.

I goal = e�ciently exploit computer cluster architectures.

...

3 of 18

Distributed Asynchronous SGD

R

ĐŽŵƉƵƟŶŐ�ŶŽĚĞƐ

Q3

P3

Q4

P4

Q2

P2

Q1

P1

1 2 3 4

R1 R4

R3R2

I R is splitted;

I vectors arereplicated;

I replicasconcurrentlyupdated;

I replicas deviateinconsistently;

I synchronization.

4 of 18

Bulk Synchronous Processing Model

ĐŽŵƉƵƟŶŐ�ŶŽĚĞƐ

ϭ͘�ůŽĐĂůĐŽŵƉƵƚĂƟŽŶ

Ϯ͘�ĐŽŵŵƵŶŝĐĂƟŽŶ

ϯ͘�ďĂƌƌŝĞƌƐǇŶĐŚƌŽŶŝǌĂƟŽŶ

5 of 18

Challenges 1/2

R

1 2 3 4

R1 R4

R3R2

I 1. Load balanceB

ensure that computing nodes are

fed with the same load.

I 2. Minimize communicationB

minimize vector replicas.

6 of 18

Challenges 2/2

...

I 3. Tune synchronization frequencyamong computing nodes.

I Current implementations synchronizevector copies:B

continuously during the epoch

(waste of resources);

Bonce after every epoch

(slow convergence).

Iepoch = a single iteration over the ratings.

7 of 18

Contributions

3 We mitigate the load imbalance by proposing an inputslicing solution based on graph partitioning algorithms;

3 we show how to reduce the number of shared data byproperly leveraging known characteristics of the input dataset(bipartite power-law nature);

3 we show how to leverage the tradeo↵ betweencommunication cost and algorithm convergence rate bytuning the frequency of the bulk synchronization phaseduring the computation.

8 of 18

Graph representation

I The rating matrix describes a bipartite graph.

x

x x

x

x

x

xx

x

x

x

x

R

I Real data: skewed power-law degree distribution.

9 of 18

Input partitioner

I vertex-cut performs better than edge-cut in power-law graphs.

I Assumption: the inputdata doesn’t fit inmain memory.

I Streaming algorithm.

I Balanced k-wayvertex-cut graphpartitioning:B

minimize replicas;

Bbalance edge load.

10 of 18

Balanced Vertex-Cut Streaming Algorithms

I Hashing: pseudo-random edge assignment;I Grid: shu✏e and split the rating matrix in identical blocks;I Greedy: [Gonzalez et al. 2012] and [Ahmed et al. 2013].

Bipartite Aware Greedy AlgorithmI Real word bipartite graphs are often significantly skewed: oneof the two sets is much bigger than the other.

I By perfectly splitting the bigger set it is possible to achieve asmaller replication factor.

I GIP (Greedy - Item Partitioned)I GUP (Greedy - User Partitioned)

11 of 18

Evaluation: The Data Sets

Degree distributions:

100

101

102

103

104

100 101 102 103 104 105

num

ber

of ve

rtic

es

degree

itemsusers

MovieLens 10M

100

101

102

103

104

100 101 102 103 104 105

num

ber

of ve

rtic

es

degree

itemsusers

Netflix

12 of 18

Experiments: Partitioning quality

MovieLens

1

10

100

8 16 32 64 128 256

replic

atio

n fact

or

partitions

hashinggridgreedy

GIPGUP

1

10

100

8 16 32 64 128 256

loa

d s

tan

da

rd d

evi

atio

n (

%)

partitions

hashinggridgreedy

GIPGUP

Netflix

1

10

100

8 16 32 64 128 256re

plic

atio

n fact

or

partitions

hashinggridgreedy

GIPGUP

1

10

100

8 16 32 64 128 256

loa

d s

tan

da

rd d

evi

atio

n (

%)

partitions

hashinggridgreedy

GIPGUP

I RF

ReplicationFactor

I RSD

RelativeStandardDeviation

13 of 18

Synchronization frequency

f = 1

f = 100

Netflix

567

10

20

30

405060

0 5 10 15 20 25 30

loss

(× 1

07)

epoch

gridgreedy

GUPcentralized

56789

10

20

30

405060

0 5 10 15 20 25 30

loss

(× 1

07)

epoch

gridgreedy

GUPcentralized

I f = synchronizationfrequency parameterB

number of

synchronization steps

during an epoch.

I tradeo↵ betweencommunication cost andconvergence rate.

14 of 18

Evaluation: SSE and Communication cost

MovieLens

1011

1012

1013

1014

1015

100 101 102 103

SS

E

frequency

grid, |C|=32greedy, |C|=32

GUP, |C|=32grid, |C|=8

greedy, |C|=8GUP, |C|=8

1

2

3

4

5

6

7

8

9

10

100 101 102 103 104 105

com

munaca

tion c

ost

(× 1

07)

frequency

hashinggrid

greedyGUP

0.010.05

0.1

0.15

0.2

0.25

101

Zoom

Netflix

1013

1014

1015

1016

1017

100 101 102 103

SS

Efrequency

grid, |C|=32greedy, |C|=32

GUP, |C|=32grid, |C|=8

greedy, |C|=8GUP, |C|=8

1

2

3

4

5

6

7

8

9

10

100 101 102 103 104 105 106

com

munaca

tion c

ost

(× 1

09)

frequency

hashinggrid

greedyGUP

0.01

0.05

0.1

0.15

0.2

101

Zoom

I SSE

between

ASGD

variants and

SGD curves

I CC

15 of 18

Communication cost

I T = the training set

I U = users set

I I = items set

I V = U [ I

I C = processing nodes

I RF = replication factor

I RFU = users’ RF

I RFI = items’ RF

RF =|U |RFU + |I |RFI

|V |

f = 1 ! CC ⇡ 2(|U |RFU + |I |RFI ) = 2|V |RF

f =|T ||C | ! CC ⇡ |T |(RFU + RFI )

16 of 18

Conclusions

I three distinct contributions aimed at improving the e�ciencyand scalability of ASGD:

1. we proposed an input slicing solution based on graphpartitioning approach that mitigates the load imbalanceamong SGD instances (i.e. better scalability);

2. we further reduce the amount of shared data by exploitingspecific characteristics of the training dataset. This provideslower communication costs during the algorithm execution(i.e. better e�ciency);

3. we introduced a synchronization frequency parameter drivinga tradeo↵ that can be accurately leveraged to further improvethe algorithm e�ciency.

17 of 18

Thank you!

Questions?

Fabio Petroni

Rome, Italy

Current position:

PhD Student in Computer Engineering, Sapienza University of Rome

Research Areas:

Recommendation Systems, Collaborative Filtering, Distributed Systems

[email protected]

21 of 21

GASGD: Stochastic Gradient Descent for Distributed ...midlab.diag.uniroma1.it/articoli/presentation-recsys.pdf · GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix

Documents