Top Banner
Jinyu Li Microsoft
59

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Apr 11, 2017

Download

Technology

Bill Liu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Jinyu LiMicrosoft

Page 2: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 3: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Review the deep learning trends for automatic speech recognition (ASR) in industry◦ Deep Neural Network (DNN)◦ Long Short-Term Memory (LSTM)◦ Connectionist Temporal Classification (CTC)

} Describe selected key technologies to make deep learning models more effective under production environment

Page 4: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 5: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Feature Analysis (Spectral Analysis)

Language Model

Word Lexicon

Confidence Scoring

Pattern Classification

(Decoding, Search)

Acoustic Model (HMM)

Input Speech “Hey Cortana”

(0.9) (0.8)

s(n), W

Xn

W

Page 6: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Feature Analysis (Spectral Analysis)

Language Model

Word Lexicon

Confidence Scoring

Pattern Classification

(Decoding, Search)

Acoustic Model (HMM)

Input Speech “Hey Cortana”

(0.9) (0.8)

s(n), W

Xn

W

Page 7: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Word sequence: Hey Cortana} Phone sequence: hh ey k ao r t ae n ax} Triphonesequence: sil-hh+ey hh-ey+k ey-k+ao k-ao+r

ao-r+ae ae-n+ax n-ax+sil} Every triphone is then modeled by a three-state HMM: sil-

hh+ey[1], sil-hh+ey[2], sil-hh+ey[3], hh-ey+k[1], ......, n-ax+sil[3]. The key problem is how to evaluate the state likelihood given the speech signal.

Page 8: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 9: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]

Page 10: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]

Page 11: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3]hh-ey+k[1] n-ax+sil[3]

Page 12: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]

Page 13: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} ZH-CN is improved by 32% within one year!

0

5

10

15

20

25

30

35

GMM MFCC CE DNN LFB CE DNN LFB SE DNN

ZH-CN Relative Improvement

CERR

CE: Cross Entropy trainingSE: SEquence training

Page 14: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 15: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

DNNs process speech frames independently

tx1−tx ( )bxWh += thxt σ

Page 16: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

RNN considers temporal relation over speech frames.

tx1−tx

Vulnerable to gradients vanishing and exploding( )bhWxWh ++= −1thhthxt σ

Page 17: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Memory cells store the history information

Various gates control the information flow inside LSTM

Advantageous in learning long short-term temporal dependency

tx1−tx

Page 18: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

SMD2015 VS2015 MobileC Mobile Win10C

WER

RRelative WER reduction of LSTM from DNN

Page 19: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 20: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 21: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …

Various resources: lexicon, decision trees questions, …

Many hyper-parameters: number of senones, number of Gaussians, …

CI Phone

CD Senone

DNN/ LSTM

GMM Hybrid

Page 22: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Feature Analysis (Spectral Analysis)

Language Model

Word Lexicon

Confidence Scoring

Pattern Classification

(Decoding, Search)

Acoustic Model (HMM)

Input Speech “Hey Cortana”

(0.9) (0.8)

s(n), W

Xn

W

Page 23: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …

Various resources: lexicon, decision trees questions, …

Many hyper-parameters: number of senones, number of Gaussians, …

LM building requests tons of data and complicated process also

Efficient decoder writing needs experts with years’ experience

Page 24: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …

Various resources: lexicon, decision trees questions, …

Many hyper-parameters: number of senones, number of Gaussians, …

LM building requests tons of data and complicated process also

Efficient decoder writing needs experts with years’ experience

Page 25: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

End-to-EndModel

“Hey Cortana”

} ASR is a sequence-to-sequence learning problem.} A simpler paradigm with a single model (and training

stage) is desired.

Page 26: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Allow repetitions of non-blank labels

Add the blank as an additional label, meaning no (actual) labels are emitted

A B C!A!!!A!!!∅!!!∅!!!B!!!C!!!∅!!∅!!!A!!!A!!!B!!!∅!!!C!!!C!!∅!!!∅!!!∅!!!A!!!B!!!C!!!∅!

collapse

expand

} CTC is a sequence-to-sequence learning method used to map speech waveforms directly to characters, phonemes, or even words

} CTC paths differ from labels sequences in that:

A B C

-- labels sequence z -- observation frames X

Page 27: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

t-1 t t+1

LSTM LSTM LSTM……

softmax

∅blank

words

Page 28: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Directly from speech to text, no language model, no decoder, no lexicon……

Page 29: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 30: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Reduce runtime cost without accuracy loss

} Adapt to speakers with low footprints

} Reduce accuracy gap between large and small deep networks

} Enable languages with limited training data

Page 31: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

[Xue13]

Page 32: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.

Page 33: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.

} We propose a new DNN structure by taking advantage of the low-rank property of DNN model to compress it

Page 34: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} How to reduce the runtime cost of DNN ?SVD !!!

} speaker personalization & AM modularization.

Page 35: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

𝐴"×$ = 𝑈"×$∑$×$𝑉$×$) =𝑢++ ⋯ 𝑢+$⋮ ⋱ ⋮

𝑢"+ ⋯ 𝑢"$/

𝜖++ ⋯⋮ ⋱

0 ⋯ 0⋮ ⋱ ⋮

0 ⋯⋮ ⋱0 ⋯

𝜖22 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝜖$$

/𝑣++ ⋯ 𝑣+$⋮ ⋱ ⋮𝑣$+ ⋯ 𝑣$$

Page 36: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Number of parameters: mn->mk+nk. } Runtime cost: O(mn) -> O(mk+nk). } E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.

Page 37: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 38: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 39: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Singular Value Decomposition

Page 40: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

LSTM LSTM

tx1−tx

LSTM

1+tx

Copy

DNN Model LSTM Model

DNN DNN

tx1−tx

DNN

1+tx

Copy

Page 41: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Split training utterances through frame skipping

2x1x 3x 5x4x 6x

1x 3x 5x 2x 4x 6x

When skipping 1 frame, odd and even frames are picked as separate utterances

Frame labels are selected accordingly

Page 42: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

[Xue 14]

Page 43: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Speaker personalization with a deep model creates a storage size issue: It is not practical to store an entire deep models for each individual speaker during deployment.

Page 44: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Speaker personalization with a DNN model creates a storage size issue: It is not practical to store an entire DNN model for each individual speaker during deployment.

} We propose low-footprint DNN personalization method based on SVD structure.

Page 45: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 46: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

0 0.36

18.6420.86

30

7.4 7.4

0.26

FULL-SIZE DNN SVD DNN STANDARD ADAPTATION SVD ADAPTATION

Adapting with 100 utterances

Relative WER reduction Number of parameters (M)

Page 47: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 48: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} SVD matrices are used to reduce the number of DNN parameters and CPU cost.

} Quantization for SSE evaluation is used for single instruction multiple data processing.

} Frame skipping is used to remove the evaluation of some frames.

Page 49: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} The industry has strong interests to have DNN systems on devices due to the increasingly popular mobile scenarios.

} Even with the technologies mentioned above, the large computational cost is still very challenging due to the limited processing power of devices.

} A common way to fit CD-DNN-HMM on devices is to reduce the DNN model size by ◦ reducing the number of nodes in hidden layers◦ reducing the number of targets in the output layer

Page 50: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Better accuracy is obtained if we use the output of large-size DNN for acoustic likelihood evaluation

} The output of small-size DNN is away from that of large-size DNN, resulting in worse recognition accuracy

} The problem is solved if the small-size DNN can generate similar output as the large-size DNN

...

...

...

...

...

...Text

...

...

...

...

...

...

...

...

Page 51: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

◦ Use the standard DNN training method to train a large-size teacher DNN using transcribed data

◦ Minimize the KL divergence between the output distribution of the student DNN and teacher DNN with large amount of un-transcribed data

Page 52: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} 2 Million parameter for small-size DNN, compared to 30 Million parameters for teacher DNN.

} The footprint is further reduced to 0.5 million parameter when combining with SVD.

Teacher DNN trained with standard sequence training

Small-size DNN trained with standard sequence training

Student DNN trained with output distribution learning

Accuracy

Page 53: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

[Huang 13]

Page 54: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Develop a new language in new scenario with small amount of training data.

Page 55: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Develop a new language in new scenario with small amount of training data.

} Leverage the resource-rich languages to develop high-quality ASR for resource-limited languages.

Page 56: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Page 57: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

...

...

...

...

...

...

...

InputLayer:Awindowofacousticfeatureframes

SharedFeatureTransformation

OutputLayer

Newlanguagesenones

NewLanguage TrainingorTestingSamples

Text

ManyHiddenLayers

Page 58: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

0

5

10

15

20

25

3 hrs 9hrs 36hrs 139hrs

releative error reduction

Page 59: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference