Top Banner
Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU? Florin Rusu Yujing Ma, Martin Torres (Ph.D. students) University of California Merced
36

Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Jul 27, 2018

Download

Documents

trandiep
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Florin Rusu

Yujing Ma, Martin Torres (Ph.D. students)

University of California Merced

Page 2: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Machine Learning (ML) Boom

• Two SIGMOD 2017 tutorials

Page 3: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

ML Systems

General purpose (databases) • BIDMach • Bismarck • Cumulon • DeepDive • DimmWitted • GLADE • GraphLab • MADlib • Mahout • MLlib (MLbase) • SimSQL (BUDS) • SystemML • Vowpal Wabbit • …

Deep learning

• Caffe (con Troll)

• CNTK

• DL4J

• Keras

• MXNet

• SINGA

• TensorFlow

• Theano

• Torch

• …

Page 4: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

ML Hardware Accelerators

Page 5: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

ML Systems with GPU Acceleration

General purpose • BIDMach • Bismarck • Cumulon • DeepDive • DimmWitted • GLADE • GraphLab • MADlib • Mahout • MLlib (MLbase) • SimSQL (BUDS) • SystemML • Vowpal Wabbit • …

Deep learning

• Caffe

• CNTK

• DL4J

• Keras

• MXNet

• SINGA

• TensorFlow

• Theano

• Torch

• …

Page 6: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

ML in Databases

• It is not so much about deep learning – Regression (linear, logistic) – Classification (SVM) – Recommendation (LMF)

• Mostly about training – Inside DB, close to data – Over joins or factorized databases – Compressed data, (compressed) large models

• Selection of optimization algorithm and hyper-parameters – BGD vs. SGD vs. SCD

Page 7: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Classification Tasks

• Logistic regression (LR)

• Support Vector Machines (SVM)

Page 8: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Datasets and Platforms

S. Sallinen et al: “High Performance Parallel Stochastic Gradient Descent in Shared Memory” in IPDPS 2016.

• CPU: Intel Xeon E5-2660 (14 cores, 28 threads)

• GPU: Tesla K80 (use only one multiprocessor)

Page 9: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Experiments • Stochastic gradient descent (SGD) optimizer: mini-batch with 4096 batch size • Average time per iteration over 100 iterations (measure only the iteration time) • TensorFlow and MXNet support only dense data: covtype and w8a are “densified”;

others do not fit in GPU memory

LR

SVM

Page 10: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Research Questions

• Why is GPU not significantly better than CPU on LR and SVM models?

– The gain in deep nets seems to come mostly from convolutions, not gradient computations

– SparseMatrix-Vector (SpMV) and SparseMatrix-Matrix (SpMM) are harder to optimize

• Can we improve the GPU performance?

Page 11: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Gradient Descent

Page 12: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

(Mini-)Batch Gradient Descent (BGD)

Page 13: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Parallel BGD • Parallel execution on CPU and GPU

• Synchronous execution on CPU

Page 14: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Stochastic Gradient Descent (SGD)

Page 15: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Parallel SGD (Hogwild)

• No synchronization or locks

Page 16: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

BGD vs. SGD

Page 17: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

GPU Architecture Tesla K80 (GK210) • # MP = 13 • # cores/MP = 192 • # warps/MP = 64 • # blocks/MP = 16 • # threads/MP = 2048 • # threads/warp (SIMD) = 32 • # threads/block = 1024

• # registers/MP = 217

• # registers/block = 216 • # registers/thread = 255

• Shared mem/MP = 112KB • Shared mem/block = 48KB

• L1 cache = 48KB • Read-only texture = 48KB • L2 cache = 1.5MB • Global mem = 12GB

Page 18: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Map Hogwild to GPU

Algorithm 1. Copy data and model to GPU 2. While not converge do

1. Execute kernel update_model that implements Hogwild

3. End while

Page 19: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Design Space

Data access • Storage scheme

• Row-store • Column-store

• Partitioning • Round-robin • Chunking

Data replication • Number of threads

accessing an example • 1-way • K-way

Model replication • Where is model stored on GPU

memory hierarchy • Per thread (registers) • Per block (shared memory) • Per kernel (global memory)

Page 20: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Evaluation Metrics

• DimmWitted by Zhang and Re in PVLDB 2014

• Hardware efficiency

– Time to convergence

• Statistical efficiency

– Number of iterations to convergence

Page 21: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Data Access – Storage Scheme

Page 22: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Data Access – Partitioning

Page 23: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Evaluation – Dense Data (covtype)

Hardware efficiency Statistical efficiency

Page 24: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Evaluation – Sparse Data (news)

Hardware efficiency Statistical efficiency

Page 25: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Data Replication

Page 26: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Evaluation – Dense Data (covtype)

Hardware efficiency Statistical efficiency

Page 27: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Evaluation – Sparse Data (news)

Hardware efficiency Statistical efficiency

Page 28: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Model Replication – PerThread

Page 29: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Model Replication – PerBlock

Page 30: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Model Replication – PerKernel

Page 31: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Evaluation – Dense Data (covtype)

Hardware efficiency Statistical efficiency

Page 32: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Evaluation – Sparse Data (news)

Hardware efficiency Statistical efficiency

Page 33: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Comparison with Synchronous SGD

Page 34: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

CPU vs. GPU

Page 35: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Conclusions

• Synchronous mini-batch in deep learning systems is rarely faster in convergence on GPU than on CPU

• Asynchronous SGD on GPU is always faster in time per iteration than synchronous mini-batch on GPU

• Asynchronous SGD on GPU is sometimes faster in convergence than asynchronous SGD on CPU

Page 36: Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Thank you.

Questions ???