Page 1
ComparingRecentWaveformGenerationandAcousticModelingMethodsfor
Neural-network-basedSpeechSynthesis
XinWANG,JaimeLorenzo-Trueba,ShinjiTAKAKI,LauriJuvela,JunichiYAMAGISHI
NationalInstituteofInformatics,Japan&AaltoUniversity,Finland2018-04-17
1contact:[email protected] ,suggestions,anddiscussion
ICASSP2018Calgary,Canada
Page 2
qMotivation• Bettermodulesforthestatisticalparametricspeech
synthesis(SPSS)framework?
qMethod• Plugandtestnewacousticmodelsandwaveform
generators
qResults• Bestcombination
• Quality:asgoodasvocodedspeech(at16kHz)
2
OVERVIEW
Autoregressive(AR)acousticmodelsWaveNet-basedvocoder
Page 3
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
3
Page 4
F0
Backgroundq ConventionalTTSpipeline[1]
q SPSSback-end[2,3]
• MGC:Mel-generalizedcepstralcoefficients[4]• BAP:band-aperiodicity
4
INTRODUCTION
SpectralfeaturesAcousticmodels
Waveformgenerators
Text Front-end Back-endLinguisticfeatures Speech
SpeechLinguisticfeatures
[1]T.Dutoit.AnIntroductiontoText-to-speechSynthesis.KluwerAcademicPublishers,Norwell,MA,USA,1997.[2]Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[3]Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.[4]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.
MGC&BAP
Page 5
Topicofthisworkq BettermodulesforSPSSback-end?
• MGC:Mel-generalizedcepstralcoefficients• BAP:band-aperiodicity
5
INTRODUCTION
F0Acousticmodels
Waveformgenerators
SpeechLinguisticfeatures
Recurrentneuralnetworks(RNNs)
Autoregressive(AR)models
Generaladversarialnetwork(GAN)
…
WORLDvocoder
+Phraserecovery
WaveNet-basedvocoder
…
MGC&BAP
Page 6
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
6
Page 7
7
MODELS &METHODS
Waveformgenerators
Acousticmodels
Linguisticfeatures
MGC&BAP&F0
Speechwaveforms
RNN
WORLD
Page 8
Acousticmodelsq BaselineRNN
• Sequenceoflinguisticfeatures• Sequenceofgeneratedacousticfeatures
8
MODELS &METHODS
x1 x2 x3 x4 x5
bo1 bo2 bo3 bo4 bo5
{x1, · · · ,xt, · · · }
{bo1, · · · , bot, · · · }
Page 9
Acousticmodelsq BaselineRNN
9
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)
H(RNN)⇥ (·)
bot = µt
Neuralnetwork
Probabilisticmodels
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt, I)
[5]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.
Page 10
10
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt, I)
ARmodels
GAN
Acousticmodelsq BaselineRNN
• Limitations1. Conditionalindependence2. Maximum-likelihoodtraining
Page 11
Acousticmodelsq ShallowAR(SAR)
• Alternativeinterpretation:trainablefilter+RNN
11
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5 K=2
p(o1:T |x1:T ;⇥, ) =TY
t=1
p(ot|ot�K:t�1,x1:T ;⇥) =TY
t=1
N (ot;µt + f (ot�K:t�1), I)
[6]X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforparametricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.
Page 12
Acousticmodelsq DeepAR(DAR)
• Onlyfor(quantized)F0modeling
12
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|o1:t�1,x1:T ;⇥)
[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)
Page 13
Acousticmodelsq GAN
• GAN-basedpost-filter[8]
13
MODELS &METHODS
Acousticmodel
Linguisticfeatures
ResidualgeneratorNoise Discriminator
Acousticfeatures(generated)
+
Acousticfeatures(natural)
0/1
[8]T.Kaneko,H.Kameoka,N.Hojo,Y.Ijima,K.Hiramatsu,andK.Kashino.Generativeadversarialnetwork-basedpostfilterforstatisticalparametricspeechsynthesis.InProc.ICASSP,pages4910–4914,2017.
Page 14
14
MODELS &METHODSAcousticmodels
• BAPisnotshown
Linguisticfeatures
SAR
WORLD
RNNDAR
F0
Waveformgenerators
Acousticmodels
MGC
GAN
Speechwaveforms
Page 15
15
MODELS &METHODSAcousticmodels
• BAPisnotshown
Linguisticfeatures
SAR RNNDAR
Waveformgenerators
Acousticmodels
GAN
SAR-Wo SGA-Wo RGA-Wo RNN-Wo
WORLD
F0 MGC
Page 16
Waveformgeneratorsq Deterministicapproaches
• WORLD[9]
o Binaryvoicingdecisiono Minimumphase
• Alogdomainpulsemodel(PML)[10]
o Source-filtermodel,additiveinlog-domaino Binarynoisymask
• WORLD+phraserecovery
16
MODELS &METHODS
[9]M.Morise,etal.WORLD:Avocoder-basedhigh-qualityspeechsynthesissystemforreal-timeapplications.IEICETrans.onInformationandSystems, 99(7):1877–1884,2016.[10]G.Degottex,etal.Alogdomainpulsemodelforparametricspeechsynthesis.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,2017.[11]D.GriffinandJ.Lim.Signalestimationfrommodifiedshort-timeFouriertransform.IEEETrans.ASSP,32(2):236–243,1984.
Phraserecovery[11]
Generatedwaveform
STFTamplitude
InverseSTFT
“Phrase-recovered”waveform
Page 17
WaveformgeneratorsqWaveNet-basedvocoder[12,13]
• Howtogenerate(search)awaveform:1. Exploration: sampling2. Exploitation: pickingone-best
17
MODELS &METHODS
[12]A.vandenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,A.Senior,andK.Kavukcuoglu.WaveNet:Agenerativemodelforrawaudio.arXiv preprintarXiv:1609.03499,2016.
[13]A.Tamamori,T.Hayashi,K.Kobayashi,K.Takeda,andT.Toda.Speaker-dependentWaveNetvocoder.InProc.Interspeech,pages1118–1122,2017.
Page 18
Sampling point
9300.09305.0
9310.09315.0
9320.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.08905.0
8910.08915.0
8920.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
• Samplinginunvoicedregion• Pickingone-bestinvoicedregion
MODELS &METHODS
59
WaveformgeneratorsqWaveNet-basedvocoder
probabilityUnvoicedsegment
Voicedregion
• Lessdistortionofharmonics☛ appendix&paper
Page 19
19
MODELS &METHODSAcousticmodels
Linguisticfeatures
SAR RNNDAR
Waveformgenerators
Acousticmodels
GAN
SAR-Wo SGA-Wo RGA-Wo RNN-Wo
F0 MGC
SAR-PmSAR-PrSAR-Wa
Phraserecovery
PMLWaveNet WORLD
minimum phase
Page 20
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
20
Page 21
Configurationq Data
• Recordingperiod:over1year
q Front-end:OpenJTalk [15]
21[14]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:AnewTTSfromATRbasedoncorpus-based
technologies.InProc.SSW5,pages179–184.[15]HTSWorkingGroup.TheJapaneseTTSSystem‘OpenJTalk’,2015.
Corpus Size Note
ATR XimeraF009voice[14]
~30,000 utterances48hours
Samplingrate:48kHzJapanese,
neutral style,reading
Feature Dimension
Linguistic Phoneidentity,prosodictags... ~390
Acoustic
MGC 60
BAP 25
F0 1
EXPERIMENTS
Page 22
Configurationq Listeningtest
• Quality:MOS(1-5)• Similarity: rate1-5,naturalreference48kHz• Participants:235nativeJapaneselisteners,1500 setsofresults
q Systems• Commonnetworkconfiguration(cf.thepaper)• Without,norformantenhancement• Samplingrate:48kHz&16kHz,exceptSAR-Wa at16kHz(10bits,𝜇-law)
22
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARnaturalnatural
�,�2
SAR
Page 23
23
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural
16kHz
48kHz
SAR
Page 24
Qualityscores
24
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural SAR
Page 25
Similarityscores
25
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural SAR
Page 26
q Introduction
qModelsandmethods
q Experiments
q Summary
CONTENTS
26
Page 27
Plugandtestq SAR:
• Avoidconditionalindependenceassumption• Alleviatetheover-smoothingeffect☛ appendix&paper
qWaveNet• Generationmethod:one-bestgeneration+randomsampling• Lessdistortionofharmonics☛ appendix&paper
qWaveNetvocoder+SAR&DAR• Betterthanothercombinations• Worsethannaturalspeech• Closetovocodedspeech
27
SUMMARY
Page 28
Recentwork☛ appendixq SAR
• Specialcaseofvolume-preservingnormalizingflow[16]
• ExtendedSAR= time-variantfilter+RNN
qWaveNet-vocoders• Trainingbasedongeneratedconditionalfeatures
q Annotatedlinguisticfeatures• Reducethegapbetweennatural&syntheticspeech
Futurework?q Reducethevariabilityofrecordingsq Usecomplex-valuedneuralmodels
28
FURTHER IMPROVEMENT?
[16]D.Rezende andS.Mohamed.Variationalinferencewithnormalizingflows.InProc.ICML,pages1530–1538,2015.
Page 29
Code,recipes,slidesq Acousticmodels&WaveNet(CUDA/C++)
q SimpleexplanationonWaveNetandacousticmodels
29
MESSAGE
https://github.com/TonyWangX/CURRENNT_MODIFIEDhttps://github.com/TonyWangX/CURRENNT_Recipes
http://tonywangx.github.io/slides.html
[email protected]
Page 30
Thankyouforyourattention
Q&A
30
[email protected]
Page 31
Conditionalnetwork
WaveNet-Backend
Learningcurve
Generationmethod
Generationwithotheracousticmodels
Trainingbasedongeneratedfeatures31
APPENDIX - WAVENET
Tutorialslides:http://tonywangx.github.io/pdfs/wavenet.pdf
• Nocherrypicking• Allsamplesbasedongeneratedacousticfeatures
orautomaticallyinferredlinguisticfeatures
Page 32
Structure
32
+
Wavenetblock 1
Clock rate: 200Hz (frame shift = 5ms)
Linearot�1Wavenetblock 2
Wavenetblock M
…
Linear+tanh Linear+tanh Softmax
s(1)t s(2)t s(M)tet�1
r(1)t
lt
… …
et
Bi-LSTM Linearc1:N concatenationCNNF0
Up-sampling
Clock rate: 16kHz
Conditional feature network
Post-processing network
P (ot|ot�R:t�1, c1:N )
APPENDIX - WAVENET
Page 33
Conditionalnetworkq Architectureoftheconditionalnetwork
• Trial1
• Trial2
• Trial3
33
Bi-LSTM Linearc1:N Concat.CNN
F0
l1:N
c1:N l1:N
c1:N l1:N
F0MGC
Bi-LSTM LinearCNN
Linear
APPENDIX - WAVENET
Page 34
Conditionalnetworkq ExperimentsonWaveNet Vocoder
• Givengenerated MGC/F0
34
sample1 sample2 sample3
Natural
Trial1None
Trial 2LSTM+CNN
Trial3LSTM+CNN+skip-F0
APPENDIX - WAVENET
Page 35
WaveNetbackendq Architecture
qWaveNet-backendonlyusesrandomsampling
35
APPENDIX - WAVENET
WaveNet backendText Textanalyzer
F0model(DAR)
Textualfeatures
Waveform
sample1 sample2 sample3 sample4 sample5
Natural
WaveNet-vocoder
WaveNet-backend
Page 36
Learningcurve
• WetrainedWaveNetbackendformorethan100epochs36
247002570026700277002870029700307003170032700
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
-loglikelihoo
d
epoch
Trainset
Val.Set
WaveNetvocoder
247002570026700277002870029700307003170032700
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
-loglikelihoo
d
epoch
Trainset
Val.Set
WaveNetbackend
APPENDIX - WAVENET
Page 37
Generationmethod
37
8900.0 9100.0 9300.0 9500.0 9700.0 9900.00
200
400
600
800
1000
wav
efor
m(m
u-la
w)
8900.0 9100.0 9300.0 9500.0 9700.0 9900.0sampling point
0
200
400
600
800
1000
prob
ablit
yWaveformlevels(0-1024)
Waveformlevels(0-1024)
APPENDIX - WAVENET
Page 38
Generationmethodq ExperimentsonWaveNet vocoder
Sampling point
9300.09305.0
9310.09315.0
9320.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.08905.0
8910.08915.0
8920.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
59
APPENDIX - WAVENET
Page 39
Generationmethodq One-best+randomsampling
• GivengeneratedMGC/F0
• Mix:
• Random:randomsamplingallthetime
39
voicedregion:one-best
unvoicedregion:randomsampling
sample1 sample2 sample3
Natural
Mix
Random
APPENDIX - WAVENET
Page 40
40
Natural
Voiced:one-bestUnvoiced:sampling
Voiced:samplingUnvoiced:sampling
Page 41
Generationmethodq One-best+randomsampling
• Mix:
• Mix2:
• …
• Mix4:
41
voicedregion:one-best
unvoicedregion:randomsampling
75%voicedframes:one-best
else:randomsampling
APPENDIX - WAVENET
25%voicedframes:one-best
else:randomsampling
Page 42
Generationmethodq One-best+randomsampling
INVESTIGATION
42
sample1 sample2 sample3
Natural
Mix
Mix2
Mix3
Mix4
Random
75%
50%
20%
100%
00%
Page 43
Generationmethodq One-best+randomsampling
43
sample1 sample2 sample3
Natural
WaveNetbackend
Mix
Random
WaveNetvocoder Mix
APPENDIX - WAVENET
Page 44
44
Natural
Mix
Random
WaveNet-backendMix
WaveNet-backendRandom
Page 45
45
WaveNet-vocoder+otheracousticmodelsAPPENDIX - WAVENET
sample1 sample2 sample3
Natural
SAR+DAR
SGA+DAR
RGA+DAR
RNN+DAR
extendedSAR+DAR
Page 46
WaveNet-vocoder:trainingusinggeneratedMGC
46
APPENDIX - WAVENET
sample1 sample2 sample3
Natural
WaveNet-Backend
Trainedonnatural MGC
Trainedongenerated MGCEpoch35
Trainedongenerated MGCEpoch45
Trainedongenerated MGCEpoch55
WaveNet-Vocoder
Page 47
Generalcomparison
SAR&DAR
SARextension
47
APPENDIX – ACOUSTIC MODELS
Moredetails:http://tonywangx.github.io/pdfs/talk.pdf
Page 48
Generalcomparisonq Generatedtrajectories
48
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�2
�1
0
1
2
3
4
MG
C1
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.8
�0.6
�0.4
�0.2
0.0
0.2
0.4
0.6
0.8
1.0
MG
C5
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
MGC1st dim
MGC5th dim
APPENDIX – ACOUSTIC MODELS
Page 49
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.4
�0.3
�0.2
�0.1
0.0
0.1
0.2
0.3
0.4
MG
C15
dim
Natural RNN SAR
49
MGC15th dim
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.20
�0.15
�0.10
�0.05
0.00
0.05
0.10
0.15
0.20
MG
C31
dim
Natural RNN SAR
MGC31th dim
Generalcomparisonq Generatedtrajectories
APPENDIX – ACOUSTIC MODELS
Page 50
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.4
�0.3
�0.2
�0.1
0.0
0.1
0.2
0.3
0.4
MG
C15
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
50
MGC15th dim
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.20
�0.15
�0.10
�0.05
0.00
0.05
0.10
0.15
0.20
MG
C31
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
MGC31th dim
Generalcomparisonq Generatedtrajectories
APPENDIX – ACOUSTIC MODELS
Page 51
51
Generalcomparisonq GV
APPENDIX – ACOUSTIC MODELS
0 1 2 3 4 5 6 7 8�2
�1
0
1
Natural
RNN
RNN+GAN (RGA)
SAR
SAR+GAN (SGA)
10 12 14 16 18 20 22 24�3.5
�3.0
�2.5
�1.5
26 28 30 32 34 36 38 40 42�4.0
�3.6
�3.0
�2.4
44 46 48 50 52 54 56 58�4.2
�3.8
�3.2
�2.6
Dimension index of MGC
Utte
ranc
e-le
velG
Vof
MG
C
Modulationspectrum(MGC31th)
Globalvariance
Page 52
52
SAR&DARq ToySARexample
• SARversusRMDNwitharecurrentoutputlayer[11]
• Assumeand,linearactivationfunction
APPENDIX – ACOUSTIC MODELS
[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.
ot 2 R ⌃t = 1
h2h1
RMDNSAR
h2h1
x1 x2x1 x2
o1 o2o1 o2
wµ
a
µ1 µ2µ1 µ2 outputlayer
hiddenlayer
Page 53
53[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
h2h1RMDN
x1 x2
o1 o2
wµµ1 µ2
SARh2h1
x1 x2
o1 o2a
µ1 µ2
µ1 = w>h1 + b
µ2 = w>h2 + b+ wµµ1 = µ̃2 + wµµ1
µ1 = w>h1 + b
µ2 = w>h2 + b
p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)
p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)
Page 54
54
h2h1RMDN
x1 x2
o1 o2
wµµ1 µ2
SARh2h1
x1 x2
o1 o2a
µ1 µ2
p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)
=1
2⇡exp(�1
2(o� µ)>⌃�1(o� µ))
o = [o1, o2]> µ = [µ1, µ2 + aµ1]
>⌃ =
1 aa 1 + a2
�
o = [o1, o2]>
⌃ =
1 00 1
�µ = [µ1, µ̃2 + wµµ1]
>
p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)
=1
2⇡exp(�1
2(o� µ)>⌃�1(o� µ))
Dependencybetweenor?µt ot
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
Page 55
55
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
µc = [µ1, µ2]> ⌃c =
1 00 1
�RMDN
h2h1
x1 x2
µ1 µ2
c2c1
SARh2h1
x1 x2
o1 o2a
µ1 µ2
o = [o1, o2]>
c =
c1c2
�=
o1
o2 � ao1
�=
1 0�a 1
� o1o2
�= Ao
p(o) = p(c) = N (c;µc,⌃c)
p(o) = N (o;µo,⌃o)
µo = [µ1, µ2 + aµ1]>
⌃o =
1 aa 1 + a2
�
Page 56
56
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• For
• SARisequivalentto:
APPENDIX – ACOUSTIC MODELS
Training
Generationx1:T
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)bo1:T
o1:T
bc1:T
c1:T
2
4c1:T,1
· · ·c1:T,D
3
5o1:T =
2
4o1:T,1
· · ·o1:T,D
3
5A(1)
A(D)
…o1:T 2 RD⇥T
…
…
A(1)
A(D)
A(1)�1
A(D)�1
Page 57
57
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• For
• SARisequivalentto:
APPENDIX – ACOUSTIC MODELS
Training
Generationx1:T
bo1:T
o1:T
bc1:T
c1:T
2
4c1:T,1
· · ·c1:T,D
3
5o1:T =
2
4o1:T,1
· · ·o1:T,D
3
5o1:T 2 RD⇥T
filtersA1(z)
AD(z)…
filters
…1/A1(z)
1/AD(z)
filter1
filterD…
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)
Page 58
58
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• Onlydueto?
• Dueto,lessmismatchbetweenandRMDN
APPENDIX – ACOUSTIC MODELS
bo1:T
o1:T
bc1:T
c1:T
filtersA1(z)
AD(z)…
filters
…1/A1(z)
1/AD(z)
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
H1(z)
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
A1(z)A1(z)
1/A1(z)
1/Ad(z)
{Ad(z), 1/Ad(z)} c1:T
Page 59
SARextension:normalizingflowq Basicidea
• Jacobianmatrixmustbesimple• f(.)mustbeinvertible
59
x1:T
bo1:T
o1:T
bc1:T
c1:Tc1:T = f�(o1:T )
bo1:T = f�1� (bc1:T )
po(o1:T |x1:T ) = pc(c1:T |x1:T )
����� det@c1:T@o1:T
�����
[13]D.Rezende andS.Mohamed.Variational inferencewithnormalizingflows.InInternationalConferenceonMachineLearning,pages1530–1538,2015.[14]D.P.Kingma,T.Salimans,R.Jozefowicz,X.Chen,I.Sutskever,andM.Welling.Improvedvariational inferencewithinverseautoregressiveflow.InProc.NIPS,pages
4743–4751,2016.
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)
APPENDIX – ACOUSTIC MODELS
Page 60
SARextension:normalizingflowq Basicidea
• SimpleforSAR:
60
SAR ARFlow
Transform
De-transform
ct = ot �KX
k=1
ak � ot�k
bot = ct +KX
k=1
ak � bot�k
po(o1:T |x1:T ) = pc(c1:T |x1:T )
����� det@c1:T@o1:T
�����
����� det@c1:T@o1:T
����� = 1
µt = RNN(o1:t�1)
ct = ot �KX
k=1
f (k)(o1:t�k)� ot�k
bot = ct +KX
k=1
f (k)(bo1:t�k)� bot�k
APPENDIX – ACOUSTIC MODELS
Page 61
61
SARextensionq SARcanbeextended
APPENDIX – ACOUSTIC MODELS
Moredetails:http://tonywangx.github.io/pdfs/talk.pdf
Page 62
62
DARq Sameautoregressiveprincipleq Butnoninvertiblenonlinear
APPENDIX – ACOUSTIC MODELSp-value
NAT DAR SAR RMDN RNNNAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30DAR <1e-30 1.6e-28 6.3e-19 2.4e-30SAR <1e-30 1.6e-28 0.015 0.949RMDN <1e-30 6.3e-19 0.015 0.014RNN <1e-30 2.4e-30 0.949 0.014
3.00
3.25
3.50
3.75
4.00
4.25
MO
S
NAT RNN RMDN SAR DAR20
40
60
80
100
120
GV
ofF0
atut
tera
nce-
leve
l(H
z)
NAT
DAR
SAR RMDN RNN
MOSscore F0GV
NAT DARSARRMDNRNN
[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)