Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Training Recurrent Neural Networks at Scale

Erich Elsen

Research Scientist

Erich Elsen

Natural User Interfaces

• Goal: Make interacting with computers as natural as interacting with humans

• AI problems:– Speech recognition

– Emotional recognition

– Semantic understanding

– Dialog systems

– Speech synthesis

Erich Elsen

Deep Speech Applications

• Voice controlled apps• Peel Partnership

• English and Mandarin APIs in the US• Integration into Baidu’s products in China

Erich Elsen

Deep Speech: End-to-end learning

• Deep neural network predicts probability of characters directly from audio

. . .

. . .

T H _ E … D O G

Erich Elsen

Connectionist Temporal Classification

Erich Elsen

Deep Speech: CTC

E .01 .05 .1 .1 .8 .05

H .01 .1 .1 .6 .05 .05

T .01 .8 .75 .2 .05 .1

BLANK .97 .05 .05 .1 .1 .8

• Simplified sequence of network outputs (probabilities)

• Generally many more timesteps than letters• Need to look at all the ways we can write “the”• Adjacent characters collapse• TTTHEE, TTTTHE, TTHHEE, THEEEE, ….• Solve with dynamic programming

Time

Erich Elsen

warp-ctc

• Recently open sourced our CTC implementation

• Efficient, parallel CPU and GPU backend

• 100-400X faster than other implementations

• Apache license, C interfacehttps://github.com/baidu-research/warp-ctc

https://github.com/baidu-research/warp-ctc

Erich Elsen

Accuracy scales with Data

Data & Model Size

Performance

Deep Learning algorithms

Many previous methods

• 40% error reduction for each 10x increase in dataset size

Erich Elsen

Training sets

• Train on ~1½ years of data (and growing)

• English and Mandarin

• End-to-end deep learning is key to assembling large datasets

• Datasets drive accuracy

Erich Elsen

Large Datasets = Large Models

Dataset Size

Big Model

Small Model

Accuracy

• Models require over 20 Exa-flops to train (exa = 10^18)

• Trained on 4+ Terabytes of audio

Erich Elsen

Virtuous Cycle of Innovation

Perform ExperimentLearn

IterateDesign New Experiment

Erich Elsen

Experiment Scaling

• Batch Norm impact with deeper networks

• Sequence wise normalization:

Erich Elsen

Parallelism across GPUs

Model Parallel

Data Parallel

MPI_Allreduce()

Training Data Training Data

For these models, Data Parallelism works best

Erich Elsen

Performance for RNN training

• 55% of GPU FMA peak using a single GPU

• ~48% of peak using 8 GPUs in one node

• Weak scaling very efficient, albeit algorithmically challenged

1

2

4

8

16

32

64

128

256

512

1 2 4 8 16 32 64 128

TF

LO

P/s

Number of GPUs

Typical training run

one node multi node

Erich Elsen

All-reduce

• We implemented our own all-reduce out of send and receive

• Several algorithm choices based on size

• Careful attention to affinity and topology

Erich Elsen

Scalability

• Batch size is hard to increase – algorithm, memory limits

• Performance at small batch sizes (32, 64) leads to scalability limits

Erich Elsen

Precision

• FP16 also mostly works– Use FP32 for softmax and weight updates

• More sensitive to labeling error

1

10

100

1000

10000

100000

1000000

10000000

100000000

-31

-30

-29

-28

-27

-26

-25

-24

-23

-22

-21

-20

-19

-18

-17

-16

-15

-14

-13

-12

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

Count

Magnitude

Weight Distribution

Erich Elsen

Conclusion

• We have to do experiments at scale

• Pushing compute scaling for end-to-end deep learning

• Efficient training for large datasets– 50 Teraflops/second sustained on one model

– 20 Exaflops to train each model

• Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides

Erich Elsen

Erich Elsen, Research Scientist, Baidu Research at MLconf NYC - 4/15/16

Technology