Top Banner
Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi YAMAGISHI National Institute of Informatics, Japan & Aalto University, Finland 2018-04-17 1 contact: [email protected] we welcome critical comments, suggestions, and discussion ICASSP 2018 Calgary, Canada
62

Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Jul 24, 2018

Download

Documents

hoangnguyet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

ComparingRecentWaveformGenerationandAcousticModelingMethodsfor

Neural-network-basedSpeechSynthesis

XinWANG,JaimeLorenzo-Trueba,ShinjiTAKAKI,LauriJuvela,JunichiYAMAGISHI

NationalInstituteofInformatics,Japan&AaltoUniversity,Finland2018-04-17

1contact:[email protected],suggestions,anddiscussion

ICASSP2018Calgary,Canada

Page 2: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

qMotivation• Bettermodulesforthestatisticalparametricspeech

synthesis(SPSS)framework?

qMethod• Plugandtestnewacousticmodelsandwaveform

generators

qResults• Bestcombination

• Quality:asgoodasvocodedspeech(at16kHz)

2

OVERVIEW

Autoregressive(AR)acousticmodelsWaveNet-basedvocoder

Page 3: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

q Introduction

qModelsandmodules

q Experiments

q Summary

CONTENTS

3

Page 4: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

F0

Backgroundq ConventionalTTSpipeline[1]

q SPSSback-end[2,3]

• MGC:Mel-generalizedcepstralcoefficients[4]• BAP:band-aperiodicity

4

INTRODUCTION

SpectralfeaturesAcousticmodels

Waveformgenerators

Text Front-end Back-endLinguisticfeatures Speech

SpeechLinguisticfeatures

[1]T.Dutoit.AnIntroductiontoText-to-speechSynthesis.KluwerAcademicPublishers,Norwell,MA,USA,1997.[2]Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[3]Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.[4]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.

MGC&BAP

Page 5: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Topicofthisworkq BettermodulesforSPSSback-end?

• MGC:Mel-generalizedcepstralcoefficients• BAP:band-aperiodicity

5

INTRODUCTION

F0Acousticmodels

Waveformgenerators

SpeechLinguisticfeatures

Recurrentneuralnetworks(RNNs)

Autoregressive(AR)models

Generaladversarialnetwork(GAN)

WORLDvocoder

+Phraserecovery

WaveNet-basedvocoder

MGC&BAP

Page 6: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

q Introduction

qModelsandmodules

q Experiments

q Summary

CONTENTS

6

Page 7: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

7

MODELS &METHODS

Waveformgenerators

Acousticmodels

Linguisticfeatures

MGC&BAP&F0

Speechwaveforms

RNN

WORLD

Page 8: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Acousticmodelsq BaselineRNN

• Sequenceoflinguisticfeatures• Sequenceofgeneratedacousticfeatures

8

MODELS &METHODS

x1 x2 x3 x4 x5

bo1 bo2 bo3 bo4 bo5

{x1, · · · ,xt, · · · }

{bo1, · · · , bot, · · · }

Page 9: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Acousticmodelsq BaselineRNN

9

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)

H(RNN)⇥ (·)

bot = µt

Neuralnetwork

Probabilisticmodels

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt, I)

[5]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.

Page 10: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

10

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt, I)

ARmodels

GAN

Acousticmodelsq BaselineRNN

• Limitations1. Conditionalindependence2. Maximum-likelihoodtraining

Page 11: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Acousticmodelsq ShallowAR(SAR)

• Alternativeinterpretation:trainablefilter+RNN

11

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5 K=2

p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥) =TY

t=1

N (ot;µt + f (ot�K:t�1), I)

[6]X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforparametricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.

Page 12: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Acousticmodelsq DeepAR(DAR)

• Onlyfor(quantized)F0modeling

12

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|o1:t�1,x1:T ;⇥)

[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)

Page 13: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Acousticmodelsq GAN

• GAN-basedpost-filter[8]

13

MODELS &METHODS

Acousticmodel

Linguisticfeatures

ResidualgeneratorNoise Discriminator

Acousticfeatures(generated)

+

Acousticfeatures(natural)

0/1

[8]T.Kaneko,H.Kameoka,N.Hojo,Y.Ijima,K.Hiramatsu,andK.Kashino.Generativeadversarialnetwork-basedpostfilterforstatisticalparametricspeechsynthesis.InProc.ICASSP,pages4910–4914,2017.

Page 14: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

14

MODELS &METHODSAcousticmodels

• BAPisnotshown

Linguisticfeatures

SAR

WORLD

RNNDAR

F0

Waveformgenerators

Acousticmodels

MGC

GAN

Speechwaveforms

Page 15: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

15

MODELS &METHODSAcousticmodels

• BAPisnotshown

Linguisticfeatures

SAR RNNDAR

Waveformgenerators

Acousticmodels

GAN

SAR-Wo SGA-Wo RGA-Wo RNN-Wo

WORLD

F0 MGC

Page 16: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Waveformgeneratorsq Deterministicapproaches

• WORLD[9]

o Binaryvoicingdecisiono Minimumphase

• Alogdomainpulsemodel(PML)[10]

o Source-filtermodel,additiveinlog-domaino Binarynoisymask

• WORLD+phraserecovery

16

MODELS &METHODS

[9]M.Morise,etal.WORLD:Avocoder-basedhigh-qualityspeechsynthesissystemforreal-timeapplications.IEICETrans.onInformationandSystems, 99(7):1877–1884,2016.[10]G.Degottex,etal.Alogdomainpulsemodelforparametricspeechsynthesis.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,2017.[11]D.GriffinandJ.Lim.Signalestimationfrommodifiedshort-timeFouriertransform.IEEETrans.ASSP,32(2):236–243,1984.

Phraserecovery[11]

Generatedwaveform

STFTamplitude

InverseSTFT

“Phrase-recovered”waveform

Page 17: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

WaveformgeneratorsqWaveNet-basedvocoder[12,13]

• Howtogenerate(search)awaveform:1. Exploration: sampling2. Exploitation: pickingone-best

17

MODELS &METHODS

[12]A.vandenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,A.Senior,andK.Kavukcuoglu.WaveNet:Agenerativemodelforrawaudio.arXiv preprintarXiv:1609.03499,2016.

[13]A.Tamamori,T.Hayashi,K.Kobayashi,K.Takeda,andT.Toda.Speaker-dependentWaveNetvocoder.InProc.Interspeech,pages1118–1122,2017.

Page 18: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Sampling point

9300.09305.0

9310.09315.0

9320.0Waveform level

0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

Sampling point

8900.08905.0

8910.08915.0

8920.0Waveform level

0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

• Samplinginunvoicedregion• Pickingone-bestinvoicedregion

MODELS &METHODS

59

WaveformgeneratorsqWaveNet-basedvocoder

probabilityUnvoicedsegment

Voicedregion

• Lessdistortionofharmonics☛ appendix&paper

Page 19: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

19

MODELS &METHODSAcousticmodels

Linguisticfeatures

SAR RNNDAR

Waveformgenerators

Acousticmodels

GAN

SAR-Wo SGA-Wo RGA-Wo RNN-Wo

F0 MGC

SAR-PmSAR-PrSAR-Wa

Phraserecovery

PMLWaveNet WORLD

minimum phase

Page 20: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

q Introduction

qModelsandmodules

q Experiments

q Summary

CONTENTS

20

Page 21: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Configurationq Data

• Recordingperiod:over1year

q Front-end:OpenJTalk [15]

21[14]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:AnewTTSfromATRbasedoncorpus-based

technologies.InProc.SSW5,pages179–184.[15]HTSWorkingGroup.TheJapaneseTTSSystem‘OpenJTalk’,2015.

Corpus Size Note

ATR XimeraF009voice[14]

~30,000 utterances48hours

Samplingrate:48kHzJapanese,

neutral style,reading

Feature Dimension

Linguistic Phoneidentity,prosodictags... ~390

Acoustic

MGC 60

BAP 25

F0 1

EXPERIMENTS

Page 22: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Configurationq Listeningtest

• Quality:MOS(1-5)• Similarity: rate1-5,naturalreference48kHz• Participants:235nativeJapaneselisteners,1500 setsofresults

q Systems• Commonnetworkconfiguration(cf.thepaper)• Without,norformantenhancement• Samplingrate:48kHz&16kHz,exceptSAR-Wa at16kHz(10bits,𝜇-law)

22

EXPERIMENTS

SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural

WORLDPML+PhraseRecWORLDWaveNetPML WORLD

RNNGANRNN

GANSARSARnaturalnatural

�,�2

SAR

Page 23: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

23

EXPERIMENTS

SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural

WORLDPML+PhraseRecWORLDWaveNetPML WORLD

RNNGANRNN

GANSARSARSARSARSARnaturalnatural

16kHz

48kHz

SAR

Page 24: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Qualityscores

24

EXPERIMENTS

SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural

WORLDPML+PhraseRecWORLDWaveNetPML WORLD

RNNGANRNN

GANSARSARSARSARSARnaturalnatural SAR

Page 25: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Similarityscores

25

EXPERIMENTS

SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural

WORLDPML+PhraseRecWORLDWaveNetPML WORLD

RNNGANRNN

GANSARSARSARSARSARnaturalnatural SAR

Page 26: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

q Introduction

qModelsandmethods

q Experiments

q Summary

CONTENTS

26

Page 27: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Plugandtestq SAR:

• Avoidconditionalindependenceassumption• Alleviatetheover-smoothingeffect☛ appendix&paper

qWaveNet• Generationmethod:one-bestgeneration+randomsampling• Lessdistortionofharmonics☛ appendix&paper

qWaveNetvocoder+SAR&DAR• Betterthanothercombinations• Worsethannaturalspeech• Closetovocodedspeech

27

SUMMARY

Page 28: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Recentwork☛ appendixq SAR

• Specialcaseofvolume-preservingnormalizingflow[16]

• ExtendedSAR= time-variantfilter+RNN

qWaveNet-vocoders• Trainingbasedongeneratedconditionalfeatures

q Annotatedlinguisticfeatures• Reducethegapbetweennatural&syntheticspeech

Futurework?q Reducethevariabilityofrecordingsq Usecomplex-valuedneuralmodels

28

FURTHER IMPROVEMENT?

[16]D.Rezende andS.Mohamed.Variationalinferencewithnormalizingflows.InProc.ICML,pages1530–1538,2015.

Page 29: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Code,recipes,slidesq Acousticmodels&WaveNet(CUDA/C++)

q SimpleexplanationonWaveNetandacousticmodels

29

MESSAGE

https://github.com/TonyWangX/CURRENNT_MODIFIEDhttps://github.com/TonyWangX/CURRENNT_Recipes

http://tonywangx.github.io/slides.html

[email protected]

Page 30: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Thankyouforyourattention

Q&A

30

[email protected]

Page 31: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Conditionalnetwork

WaveNet-Backend

Learningcurve

Generationmethod

Generationwithotheracousticmodels

Trainingbasedongeneratedfeatures31

APPENDIX - WAVENET

Tutorialslides:http://tonywangx.github.io/pdfs/wavenet.pdf

• Nocherrypicking• Allsamplesbasedongeneratedacousticfeatures

orautomaticallyinferredlinguisticfeatures

Page 32: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Structure

32

+

Wavenetblock 1

Clock rate: 200Hz (frame shift = 5ms)

Linearot�1Wavenetblock 2

Wavenetblock M

Linear+tanh Linear+tanh Softmax

s(1)t s(2)t s(M)tet�1

r(1)t

lt

… …

et

Bi-LSTM Linearc1:N concatenationCNNF0

Up-sampling

Clock rate: 16kHz

Conditional feature network

Post-processing network

P (ot|ot�R:t�1, c1:N )

APPENDIX - WAVENET

Page 33: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Conditionalnetworkq Architectureoftheconditionalnetwork

• Trial1

• Trial2

• Trial3

33

Bi-LSTM Linearc1:N Concat.CNN

F0

l1:N

c1:N l1:N

c1:N l1:N

F0MGC

Bi-LSTM LinearCNN

Linear

APPENDIX - WAVENET

Page 34: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Conditionalnetworkq ExperimentsonWaveNet Vocoder

• Givengenerated MGC/F0

34

sample1 sample2 sample3

Natural

Trial1None

Trial 2LSTM+CNN

Trial3LSTM+CNN+skip-F0

APPENDIX - WAVENET

Page 35: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

WaveNetbackendq Architecture

qWaveNet-backendonlyusesrandomsampling

35

APPENDIX - WAVENET

WaveNet backendText Textanalyzer

F0model(DAR)

Textualfeatures

Waveform

sample1 sample2 sample3 sample4 sample5

Natural

WaveNet-vocoder

WaveNet-backend

Page 36: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Learningcurve

• WetrainedWaveNetbackendformorethan100epochs36

247002570026700277002870029700307003170032700

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

-loglikelihoo

d

epoch

Trainset

Val.Set

WaveNetvocoder

247002570026700277002870029700307003170032700

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

-loglikelihoo

d

epoch

Trainset

Val.Set

WaveNetbackend

APPENDIX - WAVENET

Page 37: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generationmethod

37

8900.0 9100.0 9300.0 9500.0 9700.0 9900.00

200

400

600

800

1000

wav

efor

m(m

u-la

w)

8900.0 9100.0 9300.0 9500.0 9700.0 9900.0sampling point

0

200

400

600

800

1000

prob

ablit

yWaveformlevels(0-1024)

Waveformlevels(0-1024)

APPENDIX - WAVENET

Page 38: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generationmethodq ExperimentsonWaveNet vocoder

Sampling point

9300.09305.0

9310.09315.0

9320.0Waveform level

0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

Sampling point

8900.08905.0

8910.08915.0

8920.0Waveform level

0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

59

APPENDIX - WAVENET

Page 39: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generationmethodq One-best+randomsampling

• GivengeneratedMGC/F0

• Mix:

• Random:randomsamplingallthetime

39

voicedregion:one-best

unvoicedregion:randomsampling

sample1 sample2 sample3

Natural

Mix

Random

APPENDIX - WAVENET

Page 40: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

40

Natural

Voiced:one-bestUnvoiced:sampling

Voiced:samplingUnvoiced:sampling

Page 41: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generationmethodq One-best+randomsampling

• Mix:

• Mix2:

• …

• Mix4:

41

voicedregion:one-best

unvoicedregion:randomsampling

75%voicedframes:one-best

else:randomsampling

APPENDIX - WAVENET

25%voicedframes:one-best

else:randomsampling

Page 42: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generationmethodq One-best+randomsampling

INVESTIGATION

42

sample1 sample2 sample3

Natural

Mix

Mix2

Mix3

Mix4

Random

75%

50%

20%

100%

00%

Page 43: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generationmethodq One-best+randomsampling

43

sample1 sample2 sample3

Natural

WaveNetbackend

Mix

Random

WaveNetvocoder Mix

APPENDIX - WAVENET

Page 44: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

44

Natural

Mix

Random

WaveNet-backendMix

WaveNet-backendRandom

Page 45: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

45

WaveNet-vocoder+otheracousticmodelsAPPENDIX - WAVENET

sample1 sample2 sample3

Natural

SAR+DAR

SGA+DAR

RGA+DAR

RNN+DAR

extendedSAR+DAR

Page 46: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

WaveNet-vocoder:trainingusinggeneratedMGC

46

APPENDIX - WAVENET

sample1 sample2 sample3

Natural

WaveNet-Backend

Trainedonnatural MGC

Trainedongenerated MGCEpoch35

Trainedongenerated MGCEpoch45

Trainedongenerated MGCEpoch55

WaveNet-Vocoder

Page 47: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generalcomparison

SAR&DAR

SARextension

47

APPENDIX – ACOUSTIC MODELS

Moredetails:http://tonywangx.github.io/pdfs/talk.pdf

Page 48: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Generalcomparisonq Generatedtrajectories

48

0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

�2

�1

0

1

2

3

4

MG

C1

dim

Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)

0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

�0.8

�0.6

�0.4

�0.2

0.0

0.2

0.4

0.6

0.8

1.0

MG

C5

dim

Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)

MGC1st dim

MGC5th dim

APPENDIX – ACOUSTIC MODELS

Page 49: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

�0.4

�0.3

�0.2

�0.1

0.0

0.1

0.2

0.3

0.4

MG

C15

dim

Natural RNN SAR

49

MGC15th dim

0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

�0.20

�0.15

�0.10

�0.05

0.00

0.05

0.10

0.15

0.20

MG

C31

dim

Natural RNN SAR

MGC31th dim

Generalcomparisonq Generatedtrajectories

APPENDIX – ACOUSTIC MODELS

Page 50: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

�0.4

�0.3

�0.2

�0.1

0.0

0.1

0.2

0.3

0.4

MG

C15

dim

Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)

50

MGC15th dim

0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

�0.20

�0.15

�0.10

�0.05

0.00

0.05

0.10

0.15

0.20

MG

C31

dim

Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)

MGC31th dim

Generalcomparisonq Generatedtrajectories

APPENDIX – ACOUSTIC MODELS

Page 51: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

51

Generalcomparisonq GV

APPENDIX – ACOUSTIC MODELS

0 1 2 3 4 5 6 7 8�2

�1

0

1

Natural

RNN

RNN+GAN (RGA)

SAR

SAR+GAN (SGA)

10 12 14 16 18 20 22 24�3.5

�3.0

�2.5

�1.5

26 28 30 32 34 36 38 40 42�4.0

�3.6

�3.0

�2.4

44 46 48 50 52 54 56 58�4.2

�3.8

�3.2

�2.6

Dimension index of MGC

Utte

ranc

e-le

velG

Vof

MG

C

Modulationspectrum(MGC31th)

Globalvariance

Page 52: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

52

SAR&DARq ToySARexample

• SARversusRMDNwitharecurrentoutputlayer[11]

• Assumeand,linearactivationfunction

APPENDIX – ACOUSTIC MODELS

[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.

ot 2 R ⌃t = 1

h2h1

RMDNSAR

h2h1

x1 x2x1 x2

o1 o2o1 o2

a

µ1 µ2µ1 µ2 outputlayer

hiddenlayer

Page 53: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

53[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.

SAR&DARq ToySARexample

APPENDIX – ACOUSTIC MODELS

h2h1RMDN

x1 x2

o1 o2

wµµ1 µ2

SARh2h1

x1 x2

o1 o2a

µ1 µ2

µ1 = w>h1 + b

µ2 = w>h2 + b+ wµµ1 = µ̃2 + wµµ1

µ1 = w>h1 + b

µ2 = w>h2 + b

p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)

p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)

Page 54: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

54

h2h1RMDN

x1 x2

o1 o2

wµµ1 µ2

SARh2h1

x1 x2

o1 o2a

µ1 µ2

p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)

=1

2⇡exp(�1

2(o� µ)>⌃�1(o� µ))

o = [o1, o2]> µ = [µ1, µ2 + aµ1]

>⌃ =

1 aa 1 + a2

o = [o1, o2]>

⌃ =

1 00 1

�µ = [µ1, µ̃2 + wµµ1]

>

p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)

=1

2⇡exp(�1

2(o� µ)>⌃�1(o� µ))

Dependencybetweenor?µt ot

SAR&DARq ToySARexample

APPENDIX – ACOUSTIC MODELS

Page 55: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

55

SAR&DARq ToySARexample

APPENDIX – ACOUSTIC MODELS

µc = [µ1, µ2]> ⌃c =

1 00 1

�RMDN

h2h1

x1 x2

µ1 µ2

c2c1

SARh2h1

x1 x2

o1 o2a

µ1 µ2

o = [o1, o2]>

c =

c1c2

�=

o1

o2 � ao1

�=

1 0�a 1

� o1o2

�= Ao

p(o) = p(c) = N (c;µc,⌃c)

p(o) = N (o;µo,⌃o)

µo = [µ1, µ2 + aµ1]>

⌃o =

1 aa 1 + a2

Page 56: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

56

SAR&DARq SAR:invertiblelinearfeature/modeltransformation

• For

• SARisequivalentto:

APPENDIX – ACOUSTIC MODELS

Training

Generationx1:T

TY

t=1

p(ct;Mt)

TY

t=1

p(ct;Mt)bo1:T

o1:T

bc1:T

c1:T

2

4c1:T,1

· · ·c1:T,D

3

5o1:T =

2

4o1:T,1

· · ·o1:T,D

3

5A(1)

A(D)

…o1:T 2 RD⇥T

A(1)

A(D)

A(1)�1

A(D)�1

Page 57: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

57

SAR&DARq SAR:invertiblelinearfeature/modeltransformation

• For

• SARisequivalentto:

APPENDIX – ACOUSTIC MODELS

Training

Generationx1:T

bo1:T

o1:T

bc1:T

c1:T

2

4c1:T,1

· · ·c1:T,D

3

5o1:T =

2

4o1:T,1

· · ·o1:T,D

3

5o1:T 2 RD⇥T

filtersA1(z)

AD(z)…

filters

…1/A1(z)

1/AD(z)

filter1

filterD…

TY

t=1

p(ct;Mt)

TY

t=1

p(ct;Mt)

Page 58: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

58

SAR&DARq SAR:invertiblelinearfeature/modeltransformation

• Onlydueto?

• Dueto,lessmismatchbetweenandRMDN

APPENDIX – ACOUSTIC MODELS

bo1:T

o1:T

bc1:T

c1:T

filtersA1(z)

AD(z)…

filters

…1/A1(z)

1/AD(z)

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

Mag

nitu

de (d

B)

H1(z)

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

Mag

nitu

de (d

B)

A1(z)A1(z)

1/A1(z)

1/Ad(z)

{Ad(z), 1/Ad(z)} c1:T

Page 59: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

SARextension:normalizingflowq Basicidea

• Jacobianmatrixmustbesimple• f(.)mustbeinvertible

59

x1:T

bo1:T

o1:T

bc1:T

c1:Tc1:T = f�(o1:T )

bo1:T = f�1� (bc1:T )

po(o1:T |x1:T ) = pc(c1:T |x1:T )

����� det@c1:T@o1:T

�����

[13]D.Rezende andS.Mohamed.Variational inferencewithnormalizingflows.InInternationalConferenceonMachineLearning,pages1530–1538,2015.[14]D.P.Kingma,T.Salimans,R.Jozefowicz,X.Chen,I.Sutskever,andM.Welling.Improvedvariational inferencewithinverseautoregressiveflow.InProc.NIPS,pages

4743–4751,2016.

TY

t=1

p(ct;Mt)

TY

t=1

p(ct;Mt)

APPENDIX – ACOUSTIC MODELS

Page 60: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

SARextension:normalizingflowq Basicidea

• SimpleforSAR:

60

SAR ARFlow

Transform

De-transform

ct = ot �KX

k=1

ak � ot�k

bot = ct +KX

k=1

ak � bot�k

po(o1:T |x1:T ) = pc(c1:T |x1:T )

����� det@c1:T@o1:T

�����

����� det@c1:T@o1:T

����� = 1

µt = RNN(o1:t�1)

ct = ot �KX

k=1

f (k)(o1:t�k)� ot�k

bot = ct +KX

k=1

f (k)(bo1:t�k)� bot�k

APPENDIX – ACOUSTIC MODELS

Page 61: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

61

SARextensionq SARcanbeextended

APPENDIX – ACOUSTIC MODELS

Moredetails:http://tonywangx.github.io/pdfs/talk.pdf

Page 62: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

62

DARq Sameautoregressiveprincipleq Butnoninvertiblenonlinear

APPENDIX – ACOUSTIC MODELSp-value

NAT DAR SAR RMDN RNNNAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30DAR <1e-30 1.6e-28 6.3e-19 2.4e-30SAR <1e-30 1.6e-28 0.015 0.949RMDN <1e-30 6.3e-19 0.015 0.014RNN <1e-30 2.4e-30 0.949 0.014

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

NAT RNN RMDN SAR DAR20

40

60

80

100

120

GV

ofF0

atut

tera

nce-

leve

l(H

z)

NAT

DAR

SAR RMDN RNN

MOSscore F0GV

NAT DARSARRMDNRNN

[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)