Top Banner
[course site] Day 4 Lecture 3 Speech Synthesis: WaveNet Antonio Bonafonte
19

Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

Feb 07, 2017

Download

Data & Analytics

Xavier Giro
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

[course site]

Day 4 Lecture 3

Speech Synthesis:WaveNetAntonio Bonafonte

Page 3: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

3

Deep architectures … but not deep (yet)

3

Text to Speech: Textual features → Spectrum of speech (many coefficients)

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

Regression module

s1

s2

s3

wavegen

“Hand-crafted” features

“Hand-crafted” features

Page 4: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

4

Text-to-Speech using WaveNet

4

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

W

“Hand-crafted” features

Page 5: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

5

Introduction

● Based on PixelCNN● Generative model operating directly on audio samples● Objective: factorised joint probability

● Stack of convolutional networks● Output: categorical distribution → softmax● Hyperparameters & overfitting controlled on validation set

Page 6: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

6

High resolution signal and long term dependencies

Page 7: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

7

Autoregressive model

DPCM decoder: next sample is (almost) reconstructed from linear causal convolution of past samples

Page 8: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

8

Dilated causal convolutions

Stacked dilated convolutions:Eg: 1, 2, 4, . . . , 512, 1, 2, 4, . . . , 512, 1, 2, 4, . . . , 512

Receptive field: 1024 x 3 → 192 ms (at 16kHz)

Page 9: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

9

Dilated causal convolutions

In training: all convolutions can be done in parallel

Page 10: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

10

Dilated causal convolutions

Generating: predictions are sequential (~ 2min. per second)

Page 11: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

11

Modeling pdf

● Not MSE● Not Mixture Density Networks (MDN)● But categorical distribution, softmax (classification

problem)

Page 12: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

12

Modeling pdf

A softmax distribution tends to work better, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values)Van den Oord et al. 2016

Signal represented using mu law: 16 bits → 8 bits (256 categories)

Page 13: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

13

Gated Activation Units

Residual Learning

Page 14: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

14

Architecture

Page 15: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

15

Conditional WaveNet

They show results with h:● Speaker ID● Music genre, instrument● TTS: Linguistic Features +F0. (duration model needed to

switch condition phoneme to phoneme.

Page 16: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

16

Results

Page 18: Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

Discussion

18

● Wavenet: deep generative model of audio samples● Convolutional nets: faster than RNN● Outperforms best TTS systems● Autoregressive model: sequential model in generation

GANs were designed to be able to generate all of x in parallel, yielding greater generation speed

Ian GoodfellowNIPS 2016 Tutorial: Generative Adversarial Networks