Top Banner
Training Recurrent Neural Networks at Scale Erich Elsen Research Scientist
18

Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Jan 22, 2018

Download

Technology

MLconf
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Training Recurrent Neural Networks at Scale

Erich Elsen

Research Scientist

Page 2: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Natural User Interfaces

• Goal: Make interacting with computers as natural as interacting with humans

• AI problems:– Speech recognition

– Emotional recognition

– Semantic understanding

– Dialog systems

– Speech synthesis

Page 3: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Deep Speech Applications

• Voice controlled apps• Peel Partnership

• English and Mandarin APIs in the US• Integration into Baidu’s products in China

Page 4: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Deep Speech: End-to-end learning

• Deep neural network predicts probability of characters directly from audio

. . .

. . .

T H _ E … D O G

Page 5: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Connectionist Temporal Classification

Page 6: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Deep Speech: CTC

E .01 .05 .1 .1 .8 .05

H .01 .1 .1 .6 .05 .05

T .01 .8 .75 .2 .05 .1

BLANK .97 .05 .05 .1 .1 .8

• Simplified sequence of network outputs (probabilities)

• Generally many more timesteps than letters• Need to look at all the ways we can write “the”• Adjacent characters collapse• TTTHEE, TTTTHE, TTHHEE, THEEEE, ….• Solve with dynamic programming

Time

Page 7: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

warp-ctc

• Recently open sourced our CTC implementation

• Efficient, parallel CPU and GPU backend

• 100-400X faster than other implementations

• Apache license, C interfacehttps://github.com/baidu-research/warp-ctc

Page 8: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Accuracy scales with Data

Data & Model Size

Performance

Deep Learning algorithms

Many previous methods

• 40% error reduction for each 10x increase in dataset size

Page 9: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Training sets

• Train on ~1½ years of data (and growing)

• English and Mandarin

• End-to-end deep learning is key to assembling large datasets

• Datasets drive accuracy

Page 10: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Large Datasets = Large Models

Dataset Size

Big Model

Small Model

Accuracy

• Models require over 20 Exa-flops to train (exa = 10^18)

• Trained on 4+ Terabytes of audio

Page 11: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Virtuous Cycle of Innovation

Perform ExperimentLearn

IterateDesign New Experiment

Page 12: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Experiment Scaling

• Batch Norm impact with deeper networks

• Sequence wise normalization:

Page 13: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Parallelism across GPUs

Model Parallel

Data Parallel

MPI_Allreduce()

Training Data Training Data

For these models, Data Parallelism works best

Page 14: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Performance for RNN training

• 55% of GPU FMA peak using a single GPU

• ~48% of peak using 8 GPUs in one node

• Weak scaling very efficient, albeit algorithmically challenged

1

2

4

8

16

32

64

128

256

512

1 2 4 8 16 32 64 128

TF

LO

P/s

Number of GPUs

Typical training run

one node multi node

Page 15: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

All-reduce

• We implemented our own all-reduce out of send and receive

• Several algorithm choices based on size

• Careful attention to affinity and topology

Page 16: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Scalability

• Batch size is hard to increase – algorithm, memory limits

• Performance at small batch sizes (32, 64) leads to scalability limits

Page 17: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Precision

• FP16 also mostly works– Use FP32 for softmax and weight updates

• More sensitive to labeling error

1

10

100

1000

10000

100000

1000000

10000000

100000000

-31

-30

-29

-28

-27

-26

-25

-24

-23

-22

-21

-20

-19

-18

-17

-16

-15

-14

-13

-12

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

Count

Magnitude

Weight Distribution

Page 18: Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Erich Elsen

Conclusion

• We have to do experiments at scale

• Pushing compute scaling for end-to-end deep learning

• Efficient training for large datasets– 50 Teraflops/second sustained on one model

– 20 Exaflops to train each model

• Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides

Erich Elsen