Recent progress on distributing deep learning Viet-Trung Tran KDE lab Department of Information Systems School of Information and Communication Technology 1
Recent progress on distributing deep learning
Viet-Trung Tran KDE lab
Department of Information Systems School of Information and Communication
Technology
1
Outline
• State of the art • Overview of neural network and deep
learning • Deep learning driven factors • Scaling deep learning
2
3
4
5
6
Perceptron
7
Feed forward neural network
8
Training algorithm
• while not done yet – pick a random training case (x, y) – run neural network on input x – modify connections to make prediction closer to
y, follow the gradient of the error w.r.t. the connections
9
Parameter learning: back propagation of error
• Calculate total error at the top • Calculate contributions to error at each step going
backwards
10
Stochastic gradient descent (SGD)
11
12
Fact
Anything humans can do in 0.1 sec, the right big 10-layer network can do too
13
DEEP LEARNING DRIVEN FACTORS
14
Big Data
15Source:EricP.Xing
Computing resources
16
"Modern" neural networks
• Deeper but faster training models – Deep belief – ConvNet – RNN (LSTM, GRU)
17
SCALING DISTRIBUTED DEEP LEARNING
18
Growing Model Complexity
19Source:EricP.Xing
Objective: minimizing time to results
• experiment turnaround time • making fast rather than optimizing resources
20
Objective: improving results
• Fact: increasing training examples, model parameters, or both, can drastically improve ultimate classification accuracy – D. C. Ciresan, U. Meier, L. M. Gambardella, and
J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010.
– R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009.
21
Scaling deep learning
• Leverage GPU • Exploit many kinds of parallelism – Model parallelism – Data parallelism
22
Why scaling out
• We can use a cluster of machines to train a modestly sized speech model to the same classification accuracy in less than 1/10th the time required on a GPU
23
Model parallelism
• Parallelism in DistBelief
24
Model parallelism [cont'd]
• Message passing during upward and downward phases
• Distributed computation • Performance gains are held by
communication costs
25
26
Source:JeffDean
Data parallelism: Downpour SGD • Divide the training data into a number of
subsets • Run a copy of the model on each of
these subsets • Before processing each mini-batch
– model replica asks for up-to-date parameters
– processes the mini-batch – sending back the gradients
• To reduce communication overhead – request parameter servers every nfech
steps, update every npush steps • A model replica is certainly working on a
set of out-of-date parameters
27
Sandblaster • Coordinator assigns each of
the N model replicas a small portion of work, much smaller than 1/Nth of the total size of a batch
• Assigns replicas new portions whenever they are free
• Schedules multiple copies of the outstanding portions and uses the result from whichever model replica finishes first
28
AllReduce – Baidu DeepImage 2015
• Each worker computes gradients and maintains a subset of parameters
• Every node fetches up-to-date parameters from all other nodes
• Optimization – Butterfly synchronization • Require log(N) steps • Last step to perform broadcasting
29
Butterfly barrier
30
Distributed Hogwild • Used by Caffe. • Each node maintains a
local replica of all parameters.
• In an iteration, node computes gradients and updates locally
• Exchange updates periodically
31
DISTRIBUTED DEEP LEARNING FRAMEWORK
32
Parameter server [OSDI 2014]
33
Apache Singa [2015]
• National University of Singapore
34
Petuum CMU [ACML 2015]
35
Stale Synchronous Parallel (SSP)
36
Structure-Aware Parallelization (Strads engine)
37
38
• Data flow graph • Distributed version has
just been released (based on gRPC)
39
Deep learning on spark
• Deeplearning4j • Adatao/Amiro scaling Tensorflow on spark • Yahoo lab released CaffeOnSpark • Data parallelism
40
DEMO APPLICATIONS
41
Vietnamese OCR
• Recognize text line rather than word, character
• Very good results with just ~20mb model, ~30 pages
42
Vietnamese predictive text model • ~ 20 MB plain text corpus • Chú hoài linh đẹp trai. Chú hoài linh • Chào buổi sáng • chị hát hay wa!! nghe thick a. • chị khởi my ơi e rất la hâm mộ • làm gì bây giờ khi • chú hoài linh thật đẹp zai và chú Trấn thành đẹp qá • chú hoài linh thật đẹp zai và chú Phánh
43
• ~ 14 MB plain text corpus • lịch sử ghi nhớ năm 1979 • tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn Kiệt • tại hội nghị, đồng chí Hồ Chí Minh nói • tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí Hồ Chí
Minh đã ngồi ở • tại đại hội Đảng lần thứ nhất vào năm 1945, • Ngay từ những ngày đầu, Đúng như nhận xét của Giáo sư
Nguyễn Văn Linh
44
CONCLUSION
45
Principles of ML System Design • ACML 2015. How to Go Really Big in AI: Strategies &
Principles for Distributed Machine Learning – How to distribute? – How to bridge computation and communication? – How to communicate? – What to communicate?
46
Thank you!
47
48HowtoGoReallyBiginAI:Strategies&PrinciplesforDistributedMachineLearning