Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

ComparingRecentWaveformGenerationandAcousticModelingMethodsfor

Neural-network-basedSpeechSynthesis

XinWANG,JaimeLorenzo-Trueba,ShinjiTAKAKI,LauriJuvela,JunichiYAMAGISHI

NationalInstituteofInformatics,Japan&AaltoUniversity,Finland2018-04-17

1contact:[email protected],suggestions,anddiscussion

ICASSP2018Calgary,Canada

qMotivation• Bettermodulesforthestatisticalparametricspeech

synthesis(SPSS)framework?

qMethod• Plugandtestnewacousticmodelsandwaveform

generators

qResults• Bestcombination

• Quality:asgoodasvocodedspeech(at16kHz)

2

OVERVIEW

Autoregressive(AR)acousticmodelsWaveNet-basedvocoder

q Introduction

qModelsandmodules

q Experiments

q Summary

CONTENTS

3

F0

Backgroundq ConventionalTTSpipeline[1]

q SPSSback-end[2,3]

• MGC:Mel-generalizedcepstralcoefficients[4]• BAP:band-aperiodicity

4

INTRODUCTION

SpectralfeaturesAcousticmodels

Waveformgenerators

Text Front-end Back-endLinguisticfeatures Speech

SpeechLinguisticfeatures

[1]T.Dutoit.AnIntroductiontoText-to-speechSynthesis.KluwerAcademicPublishers,Norwell,MA,USA,1997.[2]Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[3]Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.[4]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.

MGC&BAP

Topicofthisworkq BettermodulesforSPSSback-end?

• MGC:Mel-generalizedcepstralcoefficients• BAP:band-aperiodicity

5

INTRODUCTION

F0Acousticmodels

Waveformgenerators

SpeechLinguisticfeatures

Recurrentneuralnetworks(RNNs)

Autoregressive(AR)models

Generaladversarialnetwork(GAN)

…

WORLDvocoder

+Phraserecovery

WaveNet-basedvocoder

…

MGC&BAP

q Introduction

qModelsandmodules

q Experiments

q Summary

CONTENTS

6

7

MODELS &METHODS

Waveformgenerators

Acousticmodels

Linguisticfeatures

MGC&BAP&F0

Speechwaveforms

RNN

WORLD

Acousticmodelsq BaselineRNN

• Sequenceoflinguisticfeatures• Sequenceofgeneratedacousticfeatures

8

MODELS &METHODS

x1 x2 x3 x4 x5

bo1 bo2 bo3 bo4 bo5

{x1, · · · ,xt, · · · }

{bo1, · · · , bot, · · · }


9

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)

H(RNN)⇥ (·)

bot = µt

Neuralnetwork

Probabilisticmodels

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt, I)

[5]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.

10

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt, I)

ARmodels

GAN


• Limitations1. Conditionalindependence2. Maximum-likelihoodtraining

Acousticmodelsq ShallowAR(SAR)

• Alternativeinterpretation:trainablefilter+RNN

11

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5 K=2

p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥) =TY

t=1

N (ot;µt + f (ot�K:t�1), I)

[6]X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforparametricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.

Acousticmodelsq DeepAR(DAR)

• Onlyfor(quantized)F0modeling

12

MODELS &METHODS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|o1:t�1,x1:T ;⇥)

[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)

Acousticmodelsq GAN

• GAN-basedpost-filter[8]

13

MODELS &METHODS

Acousticmodel

Linguisticfeatures

ResidualgeneratorNoise Discriminator

Acousticfeatures(generated)

+

Acousticfeatures(natural)

0/1

[8]T.Kaneko,H.Kameoka,N.Hojo,Y.Ijima,K.Hiramatsu,andK.Kashino.Generativeadversarialnetwork-basedpostfilterforstatisticalparametricspeechsynthesis.InProc.ICASSP,pages4910–4914,2017.

14

MODELS &METHODSAcousticmodels

• BAPisnotshown

Linguisticfeatures

SAR

WORLD

RNNDAR

F0

Waveformgenerators

Acousticmodels

MGC

GAN

Speechwaveforms

15


• BAPisnotshown

Linguisticfeatures

SAR RNNDAR

Waveformgenerators

Acousticmodels

GAN

SAR-Wo SGA-Wo RGA-Wo RNN-Wo

WORLD

F0 MGC

Waveformgeneratorsq Deterministicapproaches

• WORLD[9]

o Binaryvoicingdecisiono Minimumphase

• Alogdomainpulsemodel(PML)[10]

o Source-filtermodel,additiveinlog-domaino Binarynoisymask

• WORLD+phraserecovery

16

MODELS &METHODS

[9]M.Morise,etal.WORLD:Avocoder-basedhigh-qualityspeechsynthesissystemforreal-timeapplications.IEICETrans.onInformationandSystems, 99(7):1877–1884,2016.[10]G.Degottex,etal.Alogdomainpulsemodelforparametricspeechsynthesis.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,2017.[11]D.GriffinandJ.Lim.Signalestimationfrommodifiedshort-timeFouriertransform.IEEETrans.ASSP,32(2):236–243,1984.

Phraserecovery[11]

Generatedwaveform

STFTamplitude

InverseSTFT

“Phrase-recovered”waveform

WaveformgeneratorsqWaveNet-basedvocoder[12,13]

• Howtogenerate(search)awaveform:1. Exploration: sampling2. Exploitation: pickingone-best

17

MODELS &METHODS

[12]A.vandenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,A.Senior,andK.Kavukcuoglu.WaveNet:Agenerativemodelforrawaudio.arXiv preprintarXiv:1609.03499,2016.

[13]A.Tamamori,T.Hayashi,K.Kobayashi,K.Takeda,andT.Toda.Speaker-dependentWaveNetvocoder.InProc.Interspeech,pages1118–1122,2017.

Sampling point

9300.09305.0

9310.09315.0

9320.0Waveform level

0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

Sampling point

8900.08905.0

8910.08915.0


0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

• Samplinginunvoicedregion• Pickingone-bestinvoicedregion

MODELS &METHODS

59

WaveformgeneratorsqWaveNet-basedvocoder

probabilityUnvoicedsegment

Voicedregion

• Lessdistortionofharmonics☛ appendix&paper

19


Linguisticfeatures

SAR RNNDAR

Waveformgenerators

Acousticmodels

GAN

SAR-Wo SGA-Wo RGA-Wo RNN-Wo

F0 MGC

SAR-PmSAR-PrSAR-Wa

Phraserecovery

PMLWaveNet WORLD

minimum phase

q Introduction

qModelsandmodules

q Experiments

q Summary

CONTENTS

20

Configurationq Data

• Recordingperiod:over1year

q Front-end:OpenJTalk [15]

21[14]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:AnewTTSfromATRbasedoncorpus-based

technologies.InProc.SSW5,pages179–184.[15]HTSWorkingGroup.TheJapaneseTTSSystem‘OpenJTalk’,2015.

Corpus Size Note

ATR XimeraF009voice[14]

~30,000 utterances48hours

Samplingrate:48kHzJapanese,

neutral style,reading

Feature Dimension

Linguistic Phoneidentity,prosodictags... ~390

Acoustic

MGC 60

BAP 25

F0 1

EXPERIMENTS

Configurationq Listeningtest

• Quality:MOS(1-5)• Similarity: rate1-5,naturalreference48kHz• Participants:235nativeJapaneselisteners,1500 setsofresults

q Systems• Commonnetworkconfiguration(cf.thepaper)• Without,norformantenhancement• Samplingrate:48kHz&16kHz,exceptSAR-Wa at16kHz(10bits,𝜇-law)

22

EXPERIMENTS

SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural

WORLDPML+PhraseRecWORLDWaveNetPML WORLD

RNNGANRNN

GANSARSARnaturalnatural

�,�2

SAR

23

EXPERIMENTS



RNNGANRNN

GANSARSARSARSARSARnaturalnatural

16kHz

48kHz

SAR

Qualityscores

24

EXPERIMENTS



RNNGANRNN

GANSARSARSARSARSARnaturalnatural SAR

Similarityscores

25

EXPERIMENTS



RNNGANRNN

GANSARSARSARSARSARnaturalnatural SAR

q Introduction

qModelsandmethods

q Experiments

q Summary

CONTENTS

26

Plugandtestq SAR:

• Avoidconditionalindependenceassumption• Alleviatetheover-smoothingeffect☛ appendix&paper

qWaveNet• Generationmethod:one-bestgeneration+randomsampling• Lessdistortionofharmonics☛ appendix&paper

qWaveNetvocoder+SAR&DAR• Betterthanothercombinations• Worsethannaturalspeech• Closetovocodedspeech

27

SUMMARY

Recentwork☛ appendixq SAR

• Specialcaseofvolume-preservingnormalizingflow[16]

• ExtendedSAR= time-variantfilter+RNN

qWaveNet-vocoders• Trainingbasedongeneratedconditionalfeatures

q Annotatedlinguisticfeatures• Reducethegapbetweennatural&syntheticspeech

Futurework?q Reducethevariabilityofrecordingsq Usecomplex-valuedneuralmodels

28

FURTHER IMPROVEMENT?

[16]D.Rezende andS.Mohamed.Variationalinferencewithnormalizingflows.InProc.ICML,pages1530–1538,2015.

Code,recipes,slidesq Acousticmodels&WaveNet(CUDA/C++)

q SimpleexplanationonWaveNetandacousticmodels

29

MESSAGE

https://github.com/TonyWangX/CURRENNT_MODIFIEDhttps://github.com/TonyWangX/CURRENNT_Recipes

http://tonywangx.github.io/slides.html

[email protected]

Thankyouforyourattention

Q&A

30

[email protected]

Conditionalnetwork

WaveNet-Backend

Learningcurve

Generationmethod

Generationwithotheracousticmodels

Trainingbasedongeneratedfeatures31

APPENDIX - WAVENET

Tutorialslides:http://tonywangx.github.io/pdfs/wavenet.pdf

• Nocherrypicking• Allsamplesbasedongeneratedacousticfeatures

orautomaticallyinferredlinguisticfeatures

Structure

32

+

Wavenetblock 1

Clock rate: 200Hz (frame shift = 5ms)

Linearot�1Wavenetblock 2

Wavenetblock M

…

Linear+tanh Linear+tanh Softmax

s(1)t s(2)t s(M)tet�1

r(1)t

lt

… …

et

Bi-LSTM Linearc1:N concatenationCNNF0

Up-sampling

Clock rate: 16kHz

Conditional feature network

Post-processing network

P (ot|ot�R:t�1, c1:N )

APPENDIX - WAVENET

Conditionalnetworkq Architectureoftheconditionalnetwork

• Trial1

• Trial2

• Trial3

33

Bi-LSTM Linearc1:N Concat.CNN

F0

l1:N

c1:N l1:N

c1:N l1:N

F0MGC

Bi-LSTM LinearCNN

Linear

APPENDIX - WAVENET

Conditionalnetworkq ExperimentsonWaveNet Vocoder

• Givengenerated MGC/F0

34

sample1 sample2 sample3

Natural

Trial1None

Trial 2LSTM+CNN

Trial3LSTM+CNN+skip-F0

APPENDIX - WAVENET

WaveNetbackendq Architecture

qWaveNet-backendonlyusesrandomsampling

35

APPENDIX - WAVENET

WaveNet backendText Textanalyzer

F0model(DAR)

Textualfeatures

Waveform

sample1 sample2 sample3 sample4 sample5

Natural

WaveNet-vocoder

WaveNet-backend

Learningcurve

• WetrainedWaveNetbackendformorethan100epochs36

247002570026700277002870029700307003170032700

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

-loglikelihoo

d

epoch

Trainset

Val.Set

WaveNetvocoder

247002570026700277002870029700307003170032700

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

-loglikelihoo

d

epoch

Trainset

Val.Set

WaveNetbackend

APPENDIX - WAVENET

Generationmethod

37

8900.0 9100.0 9300.0 9500.0 9700.0 9900.00

200

400

600

800

1000

wav

efor

m(m

u-la

w)

8900.0 9100.0 9300.0 9500.0 9700.0 9900.0sampling point

0

200

400

600

800

1000

prob

ablit

yWaveformlevels(0-1024)

Waveformlevels(0-1024)

APPENDIX - WAVENET

Generationmethodq ExperimentsonWaveNet vocoder

Sampling point

9300.09305.0

9310.09315.0


0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

Sampling point

8900.08905.0

8910.08915.0


0200

400600

8001000

12000.0

0.1

0.2

0.3

0.4

0.5

59

APPENDIX - WAVENET

Generationmethodq One-best+randomsampling

• GivengeneratedMGC/F0

• Mix:

• Random:randomsamplingallthetime

39

voicedregion:one-best

unvoicedregion:randomsampling


Natural

Mix

Random

APPENDIX - WAVENET

40

Natural

Voiced:one-bestUnvoiced:sampling

Voiced:samplingUnvoiced:sampling


• Mix:

• Mix2:

• …

• Mix4:

41

voicedregion:one-best

unvoicedregion:randomsampling

75%voicedframes:one-best

else:randomsampling

APPENDIX - WAVENET

25%voicedframes:one-best

else:randomsampling


INVESTIGATION

42


Natural

Mix

Mix2

Mix3

Mix4

Random

75%

50%

20%

100%

00%


43


Natural

WaveNetbackend

Mix

Random

WaveNetvocoder Mix

APPENDIX - WAVENET

44

Natural

Mix

Random

WaveNet-backendMix

WaveNet-backendRandom

45

WaveNet-vocoder+otheracousticmodelsAPPENDIX - WAVENET


Natural

SAR+DAR

SGA+DAR

RGA+DAR

RNN+DAR

extendedSAR+DAR

WaveNet-vocoder:trainingusinggeneratedMGC

46

APPENDIX - WAVENET


Natural

WaveNet-Backend

Trainedonnatural MGC

Trainedongenerated MGCEpoch35



WaveNet-Vocoder

Generalcomparison

SAR&DAR

SARextension

47

APPENDIX – ACOUSTIC MODELS

Moredetails:http://tonywangx.github.io/pdfs/talk.pdf

Generalcomparisonq Generatedtrajectories

48

0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

�2

�1

0

1

2

3

4

MG

C1

dim

Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)


�0.8

�0.6

�0.4

�0.2

0.0

0.2

0.4

0.6

0.8

1.0

MG

C5

dim


MGC1st dim

MGC5th dim



�0.4

�0.3

�0.2

�0.1

0.0

0.1

0.2

0.3

0.4

MG

C15

dim

Natural RNN SAR

49

MGC15th dim


�0.20

�0.15

�0.10

�0.05

0.00

0.05

0.10

0.15

0.20

MG

C31

dim

Natural RNN SAR

MGC31th dim




�0.4

�0.3

�0.2

�0.1

0.0

0.1

0.2

0.3

0.4

MG

C15

dim


50

MGC15th dim


�0.20

�0.15

�0.10

�0.05

0.00

0.05

0.10

0.15

0.20

MG

C31

dim


MGC31th dim



51

Generalcomparisonq GV


0 1 2 3 4 5 6 7 8�2

�1

0

1

Natural

RNN

RNN+GAN (RGA)

SAR

SAR+GAN (SGA)

10 12 14 16 18 20 22 24�3.5

�3.0

�2.5

�1.5

26 28 30 32 34 36 38 40 42�4.0

�3.6

�3.0

�2.4

44 46 48 50 52 54 56 58�4.2

�3.8

�3.2

�2.6

Dimension index of MGC

Utte

ranc

e-le

velG

Vof

MG

C

Modulationspectrum(MGC31th)

Globalvariance

52

SAR&DARq ToySARexample

• SARversusRMDNwitharecurrentoutputlayer[11]

• Assumeand,linearactivationfunction


[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.

ot 2 R ⌃t = 1

h2h1

RMDNSAR

h2h1

x1 x2x1 x2

o1 o2o1 o2

wµ

a

µ1 µ2µ1 µ2 outputlayer

hiddenlayer

53[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.



h2h1RMDN

x1 x2

o1 o2

wµµ1 µ2

SARh2h1

x1 x2

o1 o2a

µ1 µ2

µ1 = w>h1 + b

µ2 = w>h2 + b+ wµµ1 = µ̃2 + wµµ1

µ1 = w>h1 + b

µ2 = w>h2 + b

p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)

p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)

54

h2h1RMDN

x1 x2

o1 o2

wµµ1 µ2

SARh2h1

x1 x2

o1 o2a

µ1 µ2

p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)

=1

2⇡exp(�1

2(o� µ)>⌃�1(o� µ))

o = [o1, o2]> µ = [µ1, µ2 + aµ1]

>⌃ =

1 aa 1 + a2

�

o = [o1, o2]>

⌃ =

1 00 1

�µ = [µ1, µ̃2 + wµµ1]

>

p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)

=1

2⇡exp(�1

2(o� µ)>⌃�1(o� µ))

Dependencybetweenor?µt ot



55



µc = [µ1, µ2]> ⌃c =

1 00 1

�RMDN

h2h1

x1 x2

µ1 µ2

c2c1

SARh2h1

x1 x2

o1 o2a

µ1 µ2

o = [o1, o2]>

c =

c1c2

�=

o1

o2 � ao1

�=

1 0�a 1

� o1o2

�= Ao

p(o) = p(c) = N (c;µc,⌃c)

p(o) = N (o;µo,⌃o)

µo = [µ1, µ2 + aµ1]>

⌃o =

1 aa 1 + a2

�

56

SAR&DARq SAR:invertiblelinearfeature/modeltransformation

• For

• SARisequivalentto:


Training

Generationx1:T

TY

t=1

p(ct;Mt)

TY

t=1

p(ct;Mt)bo1:T

o1:T

bc1:T

c1:T

2

4c1:T,1

· · ·c1:T,D

3

5o1:T =

2

4o1:T,1

· · ·o1:T,D

3

5A(1)

A(D)

…o1:T 2 RD⇥T

…

…

A(1)

A(D)

A(1)�1

A(D)�1

57


• For

• SARisequivalentto:


Training

Generationx1:T

bo1:T

o1:T

bc1:T

c1:T

2

4c1:T,1

· · ·c1:T,D

3

5o1:T =

2

4o1:T,1

· · ·o1:T,D

3

5o1:T 2 RD⇥T

filtersA1(z)

AD(z)…

filters

…1/A1(z)

1/AD(z)

filter1

filterD…

TY

t=1

p(ct;Mt)

TY

t=1

p(ct;Mt)

58


• Onlydueto?

• Dueto,lessmismatchbetweenandRMDN


bo1:T

o1:T

bc1:T

c1:T

filtersA1(z)

AD(z)…

filters

…1/A1(z)

1/AD(z)

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

Mag

nitu

de (d

B)

H1(z)

1 250 500 750 1000Frequency bin (: /1024)

-10

-5

0

5

Mag

nitu

de (d

B)

A1(z)A1(z)

1/A1(z)

1/Ad(z)

{Ad(z), 1/Ad(z)} c1:T

SARextension:normalizingflowq Basicidea

• Jacobianmatrixmustbesimple• f(.)mustbeinvertible

59

x1:T

bo1:T

o1:T

bc1:T

c1:Tc1:T = f�(o1:T )

bo1:T = f�1� (bc1:T )

po(o1:T |x1:T ) = pc(c1:T |x1:T )

�� det@c1:T@o1:T

��

[13]D.Rezende andS.Mohamed.Variational inferencewithnormalizingflows.InInternationalConferenceonMachineLearning,pages1530–1538,2015.[14]D.P.Kingma,T.Salimans,R.Jozefowicz,X.Chen,I.Sutskever,andM.Welling.Improvedvariational inferencewithinverseautoregressiveflow.InProc.NIPS,pages

4743–4751,2016.

TY

t=1

p(ct;Mt)

TY

t=1

p(ct;Mt)


SARextension:normalizingflowq Basicidea

• SimpleforSAR:

60

SAR ARFlow

Transform

De-transform

ct = ot �KX

k=1

ak � ot�k

bot = ct +KX

k=1

ak � bot�k

po(o1:T |x1:T ) = pc(c1:T |x1:T )


��


�� = 1

µt = RNN(o1:t�1)

ct = ot �KX

k=1

f (k)(o1:t�k)� ot�k

bot = ct +KX

k=1

f (k)(bo1:t�k)� bot�k


61

SARextensionq SARcanbeextended


Moredetails:http://tonywangx.github.io/pdfs/talk.pdf

62

DARq Sameautoregressiveprincipleq Butnoninvertiblenonlinear

APPENDIX – ACOUSTIC MODELSp-value

NAT DAR SAR RMDN RNNNAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30DAR <1e-30 1.6e-28 6.3e-19 2.4e-30SAR <1e-30 1.6e-28 0.015 0.949RMDN <1e-30 6.3e-19 0.015 0.014RNN <1e-30 2.4e-30 0.949 0.014

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

NAT RNN RMDN SAR DAR20

40

60

80

100

120

GV

ofF0

atut

tera

nce-

leve

l(H

z)

NAT

DAR

SAR RMDN RNN

MOSscore F0GV

NAT DARSARRMDNRNN

[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)

Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...

Documents