Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014
Post on 07-Apr-2018
230 Views
Preview:
Transcript
Statistical ParametricSpeech SynthesisHeiga ZenGoogleJune 9th 2014
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79
Typical flow of TTS system
Sentence segmentaitonWord segmentationText normalization
Part-of-speech taggingPronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesisdiscrete rArr discrete
discrete rArr continuous
NLP
Speech
Frontend
Backend
This talk focuses on backend
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79
Typical flow of TTS system
Sentence segmentaitonWord segmentationText normalization
Part-of-speech taggingPronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesisdiscrete rArr discrete
discrete rArr continuous
NLP
Speech
Frontend
Backend
This talk focuses on backend
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79
Typical flow of TTS system
Sentence segmentaitonWord segmentationText normalization
Part-of-speech taggingPronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesisdiscrete rArr discrete
discrete rArr continuous
NLP
Speech
Frontend
Backend
This talk focuses on backend
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79
Typical flow of TTS system
Sentence segmentaitonWord segmentationText normalization
Part-of-speech taggingPronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesisdiscrete rArr discrete
discrete rArr continuous
NLP
Speech
Frontend
Backend
This talk focuses on backend
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Text-to-speech as sequence-to-sequence mapping
bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)
bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)
bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79
Typical flow of TTS system
Sentence segmentaitonWord segmentationText normalization
Part-of-speech taggingPronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesisdiscrete rArr discrete
discrete rArr continuous
NLP
Speech
Frontend
Backend
This talk focuses on backend
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79
Typical flow of TTS system
Sentence segmentaitonWord segmentationText normalization
Part-of-speech taggingPronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesisdiscrete rArr discrete
discrete rArr continuous
NLP
Speech
Frontend
Backend
This talk focuses on backend
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Typical flow of TTS system
Sentence segmentaitonWord segmentationText normalization
Part-of-speech taggingPronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZEDSPEECH
Speech synthesisdiscrete rArr discrete
discrete rArr continuous
NLP
Speech
Frontend
Backend
This talk focuses on backend
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Concatenative speech synthesis
All segments
Target cost Concatenation cost
bull Concatenate actual instances of speech from database
bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically
bull Single inventory per unit rarr diphone synthesis [1]
bull Multiple inventory per unit rarr unit selection synthesis [2]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Training
minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)
λ = arg max p(y | x λ)
bull Synthesis
minus Extract x from text to be synthesizedminus Generate most probable y from λ
y = arg max p(y | x λ)
minus Reconstruct speech from y
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Statistical parametric speech synthesis (SPSS) [3]
Speech Speech
Text Text
Parametergeneration
Speechsynthesis
Textanalysis
Speechanalysis
Textanalysis
Modeltraining
x
y
x
y
λ
bull Large data + automatic trainingrarr Automatic voice building
bull Parametric representation of speechrarr Flexible to change its voice characteristics
Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
edu
nvoi
ced
freq
trans
fer c
har
air flow
Sound sourcevoiced pulseunvoiced noise
speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Source-filter model
pulse train
white noise
speech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
x(n) = h(n) lowast e(n)
darr Fourier transform
X(ejω) = H(ejω)E(ejω)
Source excitation part Vocal tract resonance part
H(ejω)
should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Parametric models of speech signal
Autoregressive (AR) model Exponential (EX) model
H(z) =K
1minusMsum
m=0
c(m)zminusm
H(z) = expMsum
m=0
c(m)zminusm
Estimate model parameters based on ML
c = arg maxcp(x | c)
bull p(x | c) AR model rarr Linear predictive analysis [5]
bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Examples of speech spectra
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log
mag
nitu
de (d
B)
(a) ML-based cepstral analysis (b) Linear prediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Structure of state-output (observation) vectors
∆ Mel-cepstral coefficients
log F0
∆ log F0
∆∆ log F0
Spectrum part
Excitation part
∆ct
∆2ct
pt
δpt
δ2pt
ct
ot
∆∆ Mel-cepstral coefficients
Mel-cepstral coefficients
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Hidden Markov model (HMM)
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Multi-stream HMM structure
Spe
ctru
mE
xcita
tion
∆ct
∆2ct
pt
δ pt
δ 2pt
ct
ot
Stream
1
o1t
o2t
o3t
o4t
bj(ot)
b1j(o
1t )
b2j(o
2t )
b3j(o
3t )
b4j(o
4t )
bj(ot)Sprod
s=1
(bsj(o
st ))ws=
23
4
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Training process
Compute variancefloor (HCompV)
Initialize CI-HMMs bysegmental k-means (HInit)
Reestimate CI-HMMs byEM algorithm
(HRest amp HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Reestimate CD-HMMs byEM algorithm (HERest)
Decision tree-basedclustering (HHEd TB)
Reestimate CD-HMMsby EM algorithm (HERest)
Untie parameter tyingstructure (HHEd UT)
monophone(context-independent CI)
fullcontext(context-dependent CD)
EstimatedHMMs
data amp labels
Estimate CD-dur modelsfrom FB stats (HERest)
Decision tree-basedclustering (HHEd TB)
Estimated dur models
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Context-dependent acoustic modeling
bull preceding succeeding two phonemes
bull Position of current phoneme in current syllable
bull of phonemes at preceding current succeeding syllable
bull accent stress of preceding current succeeding syllable
bull Position of current syllable in current word
bull of preceding succeeding stressed accented syllables in phrase
bull of syllables from previous to next stressed accented syllable
bull Guess at part of speech of preceding current succeeding word
bull of syllables in preceding current succeeding word
bull Position of current word in current phrase
bull of preceding succeeding content words in current phrase
bull of words from previous to next content word
bull of syllables in preceding current succeeding phrase
Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Decision tree-based state clustering [7]
L=voice
L=w R=silence yes
yes yes
no
no no
w-a+sil w-a+sh gy-a+pau
g-a+silgy-a+silw-a+t
k-a+b
t-a+n
leaf nodes
yes no yes no
synthesized states
R=silence L=gy
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
Spectrum amp excitation can have different context dependencyrarr Build decision trees individually
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
State duration models [8]
i
1 2 3 4 5 6 7 T=8t
t0 t1
Probability to enter state i at t0 then leave at t1 + 1
χt0t1(i) propsum
j 6=i
αt0minus1(j)ajiat1minust0ii
t1prod
t=t0
bi(ot)sum
k 6=i
aikbk(ot1+1)βt1+1(k)
rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Stream-dependent tree-based clustering
State durationmodel
Decision treesfor
mel-cepstrum
Decision treesfor F0
Decision treefor state durmodels
HMM
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech parameter generation algorithm [9]
Generate most probable state outputs given HMM and words
o = arg maxo
p(o | w λ)
= arg maxo
sum
forallqp(o q | w λ)
asymp arg maxo
maxq
p(o q | w λ)
= arg maxo
maxq
p(o | q λ)P (q | w λ)
Determine the best state sequence and outputs sequentially
q = arg maxq
P (q | w λ)
o = arg maxo
p(o | q λ)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Best state sequence
1 2 3π1
a11
a12
a22
a23
a33
b1(ot) b2(ot) b3(ot)
o1 o2 o3 o4 o5 oT
1 1 1 1 2 2 3 3
O
Q
Observation sequence
State sequence
State duration 4 10 5D
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Best state outputswo dynamic features
Variance Mean
o becomes step-wise mean vector sequence
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Using dynamic features
State output vectors include static amp dynamic features
ot =[cgtt ∆cgtt
]gt ∆ct = ct minus ctminus1
M Mctminus1 ct+1ctminus2 ct+2ct
∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct
2M
Relationship between static and dynamic features can be arranged as
o c
ctminus1
∆ ctminus1
ct∆ ctct+1
∆ ct+1
=
middot middot middot
middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot
middot middot middot
ctminus2
ctminus1
ctct+1
W
o t
ot+1
otminus1
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints
o = arg maxo
p(o | q λ) subject to o = Wc
If state-output distribution is single Gaussian
p(o | q λ) = N (o microq Σq)
By setting part logN (Wc microq Σq)partc = 0
WgtΣminus1q Wc = WgtΣminus1
q microq
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Speech parameter generation algorithm [9]
Wgt Σminus1q W c
=
Wgt Σminus1q microq
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
1 0 0
0
10 0
-1 1
10
0
-1 1
1
0
0
0
-1 1
0 0
0
0
100 0 0
0
10
0
-11
10 0
-11
1
0
00
-11
00
00
microq1
microq2
microqT
c1
c2
cT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Generated speech parameter trajectory
Sta
ticD
ynam
ic
Variance Mean c
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
HMM-based speech synthesis [4]
Training part
Synthesis part
Training HMMs
Context-dependent HMMs amp state duration models
Labels
Spectralparameters
Excitationparameters
TEXT
Labels
SYNTHESIZEDSPEECH
Speech signal
Excitation
Parameter generationfrom HMMs
Excitationgeneration
SynthesisFilter
Text analysis
Spectralparameterextraction
SPEECHDATABASE Excitation
parameterextraction
Spectralparameters
Excitationparameters
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Waveform reconstruction
pulse train
white noise
synthesizedspeech
lineartime-invariant
systeme(n)
h(n) x(n) = h(n) lowast e(n)excitation
Generatedexcitation parameter(log F0 with VUV)
Generatedspectral parameter
(cepstrum LSP)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Synthesis filter
bull Cepstrum rarr LMA filter
bull Generalized cepstrum rarr GLSA filter
bull Mel-cepstrum rarr MLSA filter
bull Mel-generalized cepstrum rarr MGLSA filter
bull LSP rarr LSP filter
bull PARCOR rarr all-pole lattice filter
bull LPC rarr all-pole filter
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation
minus Small footprint [10 11]minus Robustness [12]
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Adaptation (mimicking voice) [13]
Average-voice model
AdaptiveTraining
Adaptation
Training speakers Target speakers
bull Train average voice model (AVM) from training speakers using SAT
bull Adapt AVM to target speakers
bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Interpolation (mixing voice) [14 15 16 17]
λ1
λ2
λ3
λ4
λprime
I(λprimeλ1)I(λprimeλ2)
I(λprimeλ3)I(λprimeλ4)
λ HMM setI(λprimeλ) Interpolation ratio
bull Interpolate representive HMM sets
bull Can obtain new voices wo adaptation data
bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Vocoding issues
bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)
Unvoiced Voiced
pulse train
white noise
e(n)
excitation
bull Spectral envelope extractionHarmonic effect often cause problem
0
40
80
0 2 4 6 8 [kHz]
Pow
er [d
B]
bull PhaseImportant but usually ignored
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Better vocoding
bull Mixed excitation linear prediction (MELP)
bull STRAIGHT
bull Multi-band excitation
bull Harmonic + noise model (HNM)
bull Harmonic stochastic model
bull LF model
bull Glottal waveform
bull Residual codebook
bull ML excitation
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Limitations of HMMs for acoustic modeling
bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state
bull Conditional independence assumptionState output probability depends only on the current state
bull Weak duration modelingState duration probability decreases exponentially with time
None of them hold for real speech
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Better acoustic modeling
bull Piece-wise constatnt statistics rarr Dynamical model
minus Trended HMMminus Polynomial segment modelminus Trajectory HMM
bull Conditional independence assumption rarr Graphical model
minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM
bull Weak duration modeling rarr Explicit duration model
minus Hidden semi-Markov model
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Oversmoothing
bull Speech parameter generation algorithm
minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled
Nat
ural
0 4 8Frequency (kHz)
Gen
erat
ed
0 4 8Frequency (kHz)
bull Why
minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Oversmoothing compensation
bull Postfiltering
minus Mel-cepstrumminus LSP
bull Nonparametric approach
minus Conditional parameter generationminus Discrete HMM-based speech synthesis
bull Combine multiple-level statistics
minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Characteristics of SPSS
bull Advantages
minus Flexibility to change voice characteristics
Adaptation Interpolation eigenvoice CAT multiple regression
minus Small footprintminus Robustness
bull Drawback
minus Quality
bull Major factors for quality degradation [3]
minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Linguistic rarr acoustic mapping
bull TrainingLearn relationship between linguistc amp acoustic features
bull SynthesisMap linguistic features to acoustic ones
bull Linguistic features used in SPSS
minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)
Effective modeling is essential
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
HMM-based acoustic modeling for SPSS [4]
yes noyes no
yes no
yes no yes no
Acoustic space
bull Decision tree-clustered HMM with GMM state-output distributions
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
DNN-based acoustic modeling for SPSS [18]
Acoustic
features y
Linguistic
features x
h1
h2
h3
bull DNN represents conditional distribution of y given x
bull DNN replaces decision trees and GMMs
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Framework
Input layer Output layerHidden layers
TEXT
SPEECH Parametergeneration
Waveformsynthesis
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
feat
ures
at f
ram
e 1
Inpu
t fea
ture
s in
clud
ing
bina
ry amp
num
eric
fe
atur
es a
t fra
me T
Textanalysis
Input featureextraction
Statistics (m
ean amp var) of speech param
eter vector sequence
Binaryfeatures
NumericfeaturesDuration
featureFrame position
feature
Spectralfeatures
ExcitationfeaturesVUVfeature
Durationprediction
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Advantages of NN-based acoustic modeling
bull Integrating feature extraction
minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling
bull Distributed representation
minus Can be exponentially more efficient than fragmentedrepresentation
minus Better representation ability with fewer parameters
bull Layered hierarchical structure in speech production
minus concept rarr linguistic rarr articulatory rarr waveform
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Framework
Is this new no
bull NN [19]
bull RNN [20]
Whatrsquos the difference
bull More layers data computational resources
bull Better learning algorithm
bull Statistical parametric speech synthesis techniques
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Experimental setup
Database US English female speaker
Training test data 33000 amp 173 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic 11 categorical featuresfeatures 25 numeric features
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2
HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]
DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Example of speech parameter trajectories
wo grouping questions numeric contexts silence frames removed
-1
0
1
0 100 200 300 400 500
5-t
h M
el-ce
pstr
um
Frame
Natural speechHMM (α=1)DNN (4x512)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Subjective evaluations
Compared HMM-based systems with DNN-based ones withsimilar of parameters
bull Paired comparison test
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
HMM DNN(α) (layers times units) Neutral p value z value
158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Limitations of DNN-based acoustic modeling
y1
y2
Data samples
NN prediction
bull Unimodality
minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean
bull Lack of variance
minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances
Linear output layer rarr Mixture density output layer [26]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Mixture density network [26]
micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)w2(x1)
y
Weights rarr Softmax activation function
Means rarr Linear activation function
Variances rarr Exponential activation function
sumzj =
4
i=1
hiwij
w1(x) =exp(z1)sum2
m=1 exp(zm)
micro1(x) = z3
σ1(x) = exp(z5)
Inputs of activation function
1-dim 2-mix MDN
w2(x) =exp(z2)sum2
m=1 exp(zm)
micro1(x) = z4
σ2(x) = exp(z6)
NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
DMDN-based SPSS [27]
micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)
micro1(x1) micro2(x1)
σ2(x1)σ1(x1)
w1(x1)
w2(x1)
y
x1
micro1(x2 ) micro2(x2 )
σ2(x2 )σ1(x2 )
w1(x2 ) w2(x2 )
y
x2
micro1(xT ) micro2(xT )
σ1(xT )σ2(xT )
w2(xT )
w1(xT )
y
xT
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Experimental setup
bull Almost the same as the previous setup
bull Differences
DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)
DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)
1ndash16 mix
Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Subjective evaluation
bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)
bull 173 test sentences 5 subjects per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115
4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109
6times1024 3652 plusmn 01087times1024 3637 plusmn 0129
1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107
(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Limitations of DNNDMDN-based acoustic modeling
bull Fixed time span for input features
minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs
minus Difficult to incorporate long time span contextual effect
bull Frame-by-frame mapping
minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential
Recurrent connections rarr Recurrent NN (RNN) [31]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Basic RNN
Output y
Input x
Recurrentconnections
xt
yt yt+1ytminus1
xt+1xtminus1
bull Only able to use previous contextsrarr bidirectional RNN [31]
bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time
minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Long short-term memory (LSTM) [32]
bull RNN architecture designed to have better memory
bull Uses linear memory cells surrounded by multiplicative gate units
ct
bi xt h tminus
it
tanh
sigm
tanh
bc
xt
h tminus
Input gate
Forget gate
Memory cell
bo xt h tminus
ht
bf xt h tminus
sigm
sigmBlock
Output gate Input gate Write
Output gate Read
Forget gate Reset
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
LSTM-based SPSS [33 34]
x1 x2
TEX
TTe
xtan
alys
isIn
put f
eatu
reex
tract
ion
Dur
atio
npr
edic
tion
SP
EE
CH
Param
etergeneration
Waveform
synthesis
xT
y1 y2 yT
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Experimental setup
Database US English female speaker
Train dev set data 34632 amp 100 sentences
Sampling rate 16 kHz
Analysis window 25-ms width 5-ms shift
Linguistic DNN 449features LSTM 289
Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)
4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)
AdaDec [29] on GPU
1 forward LSTM layerLSTM 256 units 128 projection
Asynchronous SGD on CPUs [35]
Postprocessing Postfiltering in cepstrum domain [25]
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Subjective evaluations
bull Paired comparison test
bull 100 test sentences 5 ratings per pair
bull Up to 30 pairs per subject
bull Crowd-sourced
DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p
500 142 ndash ndash 358 120 lt 10minus10
ndash ndash 302 156 542 51 lt 10minus6
158 ndash 340 ndash 502 -62 lt 10minus9
284 ndash ndash 336 380 -15 0138
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Samples
bull DNN (wo dynamic features)
bull DNN (w dynamic features)
bull LSTM (wo dynamic features)
bull LSTM (w dynamic features)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79
Outline
BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements
Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS
SummarySummary
Summary
Statistical parametric speech synthesis
bull Vocoding + acoustic model
bull HMM-based SPSS
minus Flexible (eg adaptation interpolation)minus Improvements
Vocoding Acoustic modeling Oversmoothing compensation
bull NN-based SPSS
minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79
References I
[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990
[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996
[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009
[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999
[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970
[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983
[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995
[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79
References II
[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000
[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)
[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006
[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008
[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006
[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997
[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002
[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79
References III
[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007
[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013
[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996
[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993
[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007
[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002
[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997
[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79
References IV
[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004
[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994
[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014
[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013
[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013
[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011
[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997
[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79
References V
[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts
[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014
[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
- lstm_samplespdf
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
Summary
Statistical parametric speech synthesis
bull Vocoding + acoustic model
bull HMM-based SPSS
minus Flexible (eg adaptation interpolation)minus Improvements
Vocoding Acoustic modeling Oversmoothing compensation
bull NN-based SPSS
minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79
References I
[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990
[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996
[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009
[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999
[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970
[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983
[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995
[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79
References II
[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000
[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)
[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006
[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008
[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006
[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997
[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002
[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79
References III
[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007
[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013
[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996
[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993
[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007
[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002
[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997
[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79
References IV
[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004
[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994
[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014
[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013
[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013
[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011
[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997
[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79
References V
[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts
[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014
[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
- lstm_samplespdf
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
References I
[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990
[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996
[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009
[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999
[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970
[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983
[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995
[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79
References II
[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000
[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)
[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006
[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008
[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006
[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997
[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002
[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79
References III
[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007
[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013
[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996
[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993
[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007
[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002
[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997
[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79
References IV
[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004
[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994
[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014
[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013
[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013
[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011
[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997
[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79
References V
[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts
[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014
[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
- lstm_samplespdf
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
References II
[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000
[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)
[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006
[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008
[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006
[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997
[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002
[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79
References III
[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007
[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013
[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996
[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993
[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007
[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002
[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997
[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79
References IV
[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004
[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994
[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014
[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013
[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013
[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011
[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997
[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79
References V
[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts
[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014
[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
- lstm_samplespdf
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
References III
[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007
[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013
[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996
[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993
[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007
[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002
[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997
[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79
References IV
[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004
[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994
[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014
[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013
[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013
[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011
[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997
[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79
References V
[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts
[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014
[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
- lstm_samplespdf
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
References IV
[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004
[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994
[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014
[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013
[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013
[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011
[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997
[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79
References V
[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts
[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014
[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
- lstm_samplespdf
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
References V
[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts
[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014
[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012
Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
- lstm_samplespdf
- Background
- HMM-based statistical parametric speech synthesis (SPSS)
- Flexibility
- Improvements
- Statistical parametric speech synthesis with neural networks
- Deep neural network (DNN)-based SPSS
- Deep mixture density network (DMDN)-based SPSS
- Recurrent neural network (RNN)-based SPSS
- Summary
- Summary
top related