Recent progress on distributing deep learning

Recent progress on distributing deep learning

Viet-Trung Tran KDE lab

Department of Information Systems School of Information and Communication

Technology

1

Outline

•  State of the art •  Overview of neural network and deep

learning •  Deep learning driven factors •  Scaling deep learning

2

3

4

5

6

Perceptron

7

Feed forward neural network

8

Training algorithm

•  while not done yet – pick a random training case (x, y) –  run neural network on input x – modify connections to make prediction closer to

y, follow the gradient of the error w.r.t. the connections

9

Parameter learning: back propagation of error

•  Calculate total error at the top •  Calculate contributions to error at each step going

backwards

10

Stochastic gradient descent (SGD)

11

12

Fact

Anything humans can do in 0.1 sec, the right big 10-layer network can do too

13

DEEP LEARNING DRIVEN FACTORS

14

Big Data

15Source:EricP.Xing

Computing resources

16

"Modern" neural networks

•  Deeper but faster training models – Deep belief – ConvNet – RNN (LSTM, GRU)

17

SCALING DISTRIBUTED DEEP LEARNING

18

Growing Model Complexity

19Source:EricP.Xing

Objective: minimizing time to results

•  experiment turnaround time •  making fast rather than optimizing resources

20

Objective: improving results

•  Fact: increasing training examples, model parameters, or both, can drastically improve ultimate classification accuracy – D. C. Ciresan, U. Meier, L. M. Gambardella, and

J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010.

– R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009.

21

Scaling deep learning

•  Leverage GPU •  Exploit many kinds of parallelism – Model parallelism – Data parallelism

22

Why scaling out

•  We can use a cluster of machines to train a modestly sized speech model to the same classification accuracy in less than 1/10th the time required on a GPU

23

Model parallelism

•  Parallelism in DistBelief

24

Model parallelism [cont'd]

•  Message passing during upward and downward phases

•  Distributed computation •  Performance gains are held by

communication costs

25

26

Source:JeffDean

Data parallelism: Downpour SGD •  Divide the training data into a number of

subsets •  Run a copy of the model on each of

these subsets •  Before processing each mini-batch

–  model replica asks for up-to-date parameters

–  processes the mini-batch –  sending back the gradients

•  To reduce communication overhead –  request parameter servers every nfech

steps, update every npush steps •  A model replica is certainly working on a

set of out-of-date parameters

27

Sandblaster •  Coordinator assigns each of

the N model replicas a small portion of work, much smaller than 1/Nth of the total size of a batch

•  Assigns replicas new portions whenever they are free

•  Schedules multiple copies of the outstanding portions and uses the result from whichever model replica finishes first

28

AllReduce – Baidu DeepImage 2015

•  Each worker computes gradients and maintains a subset of parameters

•  Every node fetches up-to-date parameters from all other nodes

•  Optimization – Butterfly synchronization •  Require log(N) steps •  Last step to perform broadcasting

29

Butterfly barrier

30

Distributed Hogwild •  Used by Caffe. •  Each node maintains a

local replica of all parameters.

•  In an iteration, node computes gradients and updates locally

•  Exchange updates periodically

31

DISTRIBUTED DEEP LEARNING FRAMEWORK

32

Parameter server [OSDI 2014]

33

Apache Singa [2015]

•  National University of Singapore

34

Petuum CMU [ACML 2015]

35

Stale Synchronous Parallel (SSP)

36

Structure-Aware Parallelization (Strads engine)

37

38

•  Data flow graph •  Distributed version has

just been released (based on gRPC)

39

Deep learning on spark

•  Deeplearning4j •  Adatao/Amiro scaling Tensorflow on spark •  Yahoo lab released CaffeOnSpark •  Data parallelism

40

DEMO APPLICATIONS

41

Vietnamese OCR

•  Recognize text line rather than word, character

•  Very good results with just ~20mb model, ~30 pages

42

Vietnamese predictive text model •  ~ 20 MB plain text corpus •  Chú hoài linh đẹp trai. Chú hoài linh •  Chào buổi sáng •  chị hát hay wa!! nghe thick a. •  chị khởi my ơi e rất la hâm mộ •  làm gì bây giờ khi •  chú hoài linh thật đẹp zai và chú Trấn thành đẹp qá •  chú hoài linh thật đẹp zai và chú Phánh

43

•  ~ 14 MB plain text corpus •  lịch sử ghi nhớ năm 1979 •  tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn Kiệt •  tại hội nghị, đồng chí Hồ Chí Minh nói •  tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí Hồ Chí

Minh đã ngồi ở •  tại đại hội Đảng lần thứ nhất vào năm 1945, •  Ngay từ những ngày đầu, Đúng như nhận xét của Giáo sư

Nguyễn Văn Linh

44

CONCLUSION

45

Principles of ML System Design •  ACML 2015. How to Go Really Big in AI: Strategies &

Principles for Distributed Machine Learning – How to distribute? – How to bridge computation and communication? – How to communicate? – What to communicate?

46

Thank you!

47

48HowtoGoReallyBiginAI:Strategies&PrinciplesforDistributedMachineLearning

Recent progress on distributing deep learning

Data & Analytics