Top Banner
Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge, UK March 2nd, 2011
62

Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Jun 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Vocoding approaches for statisticalparametric speech synthesis

Ranniery Maia

Toshiba Research Europe LimitedCambridge Research Laboratory

Speech Synthesis Seminar SeriesCUED, University of Cambridge, UK

March 2nd, 2011

Page 2: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Topics of this presentation

1. Existing methods to generate the speech waveform instatistical parametric speech synthesis

2. An idea for closing the gap between acoustic modeling andwaveform generation

2

Page 3: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Notation and acronyms in this presentation

I Notationx(n) a discrete-time signalX(z) x(n) in the z-transform domainX

`ejω

´Discrete-Time Fourier Transform of x(n)(frequency domain representation of x(n))˛

X`ejω

´˛magnitude response of x(n)

∠X`ejω

´phase response of x(n)˛

X`ejω

´˛2 power spectrum of x(n)x a vectorX a matrix

I AcronymsOLA OverLap and AddMELP Mixed Excitation Linear PredictionSTRAIGHT Speech Transformation and Representation using

Adaptive Interpolation of weiGHTed spectrumFFT Fast Fourier TransformIFFT Inverse Fast Fourier TransformLF Liljencrants-Fant modelLP Linear PredictionPCA Principal Component AnalysisLSP Line Spectral Pairs

3

Page 4: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

4

Page 5: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

5

Page 6: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Statistical parametric speech synthesis

I Speech synthesis methods1. Rule-based

1.1 Parametric1.2 Unit concatenation

2. Corpus-based2.1 Unit selection and concatenation2.2 Statistical parametric2.3 Hybrid

6

Page 7: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Statistical parametric speech synthesis

1. AdvantagesI several voices, small data, small footprint, language

portability, etc2. Unnatural synthesized speech

2.1 Parametric model of speech production2.2 Parameters of the model are averaged

I How to alleviate this unnaturalness?1. Statistical modeling2. Choice of the speech production model3. Choice of the parameters to represent such model4. Way of synthesizing speech with these parameters

7

Page 8: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Statistical parametric speech synthesis

1. AdvantagesI several voices, small data, small footprint, language

portability, etc2. Unnatural synthesized speech

2.1 Parametric model of speech production2.2 Parameters of the model are averaged

I How to alleviate this unnaturalness?1. Statistical modeling2. Choice of the speech production model3. Choice of the parameters to represent such model4. Way of synthesizing speech with these parameters

8

Page 9: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Statistical parametric speech synthesis

I Training time

Parameter Acoustic modelParameters

Speech waveform

extraction

Speech

training

Labels

parametersAcoustic model

c =[

c⊤0· · · c

⊤T−1

]⊤

λc = arg maxλc

p (c | ℓ, λc)

s =[

s(0) · · · s(N − 1)]⊤

λc

I Synthesis time

Parameter

Speech waveform

Waveformgeneration generation

Speech parametersLabels

Acoustic modelparameters

c = arg maxλc

p(

c | ℓ, λc

)

s =[

s(0) · · · s(N − 1)]⊤

c =[

c⊤0· · · c

⊤T−1

]⊤

λc

9

Page 10: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Waveform generation part

1. Choice of the speech production mechanismI Simple

I Speech synthesis filterI Excitation

I CompleteI Vocal tract, glottal and lip radiation filtersI Excitation

2. Appropriate parameters for the chosen speech mechanism

I Good quantization/compression properties

3. Given the speech model and corresponding parameters,design the best way to synthesize the speech signalaccording to some criteria

10

Page 11: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Waveform generation part

1. Choice of the speech production mechanismI Simple

I Speech synthesis filterI Excitation

I CompleteI Vocal tract, glottal and lip radiation filtersI Excitation

2. Appropriate parameters for the chosen speech mechanism

I Good quantization/compression properties

3. Given the speech model and corresponding parameters,design the best way to synthesize the speech signalaccording to some criteria

11

Page 12: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Digital speech models

I The complete model [Deller, Jr. et al., 2000]

generatorPulse train

Pitch period Gain

noisegenerator

Vocal tractfilter

Lip radiation

Speechsignal

Gain

Voicevolumevelocity

Uncorrelated

Glottal filter

filter

V

U

G(z)

H(z) L(z)

I The simplified model, assumed for LP analysis

Gain

All−polefilter

signalSpeech

V

UWhitenoise

generatorPulse train

Pitch period

generator

12

Page 13: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

13

Page 14: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Standard vocoder for statistical parametric synthesis

White noise

...

Pulse train

V

U Excitation Speech

Synthesisfilter

parametersSpectral

text to be synthesizedLabels representing the

Trained acoustic models

e(n)

w(n)

t(n)

P0

F0

F0

I Very simpleI Analysis: F0 extractionI Synthesis: pulse/white noise switch

I Poor speech quality!

14

Page 15: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Improved vocoding methods for statistical parametric synthesis

1. Methods that focus solely on the excitation signal1.1 Fully parametric excitation models1.2 Methods that attempt to mimic the LP residual

2. Methods that focus on source and vocal tract modeling

15

Page 16: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

16

Page 17: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

MELP mixed excitation [Yoshimura et al., 2001]

MELP excitation building part

White noise

Bandpass voicingstrengths

Pulse train

Jitter

JitteringPosition

magnitudesFourier

VoicedExcitation

MixedExcitation

UnvoicedExcitation

w(n)

t(n)Hv(z) Hp(z)

v(n)

e(n)

u(n)Hn(z)

I Period jitter derived from voicing strengths for aperiodicframes

I Fourier magnitudes simulates the glottal filterI Filters Hp(z) and Hn(z) control the amount of pulse and

noise in the final excitation e(n)

17

Page 18: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Pulse and noise shaping filters

I Filters Hp(z) and Hn(z) switch between noise and pulseexcitation according to each band

Hp(z) =J−1∑j=0

M∑m=0

βjhj(m)z−m , Hn(z) =J−1∑j=0

M∑m=0

(1− βj

)hj(m)z−m

βj =

1 if βj ≥ 0.5

0 if βj < 0.5

I hj(m): bandpass filter coefficients for the j bandI Bandpass voicing strength for the j band obtained

according to a normalized correlation coefficient

βj = f(rt) ; rt =

∑N−1n=0 s(n)s(n + t)√[∑N−1

n=0 s2(n)] [∑N−1

n=0 s2(n + t)]

18

Page 19: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Application to statistical parametric synthesis

Additional parameters for acoustic modeling1. Bandpass voicing strengths: 52. Fourier magnitudes: 10

19

Page 20: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

STRAIGHT excitation [Zen et al., 2007a]

STRAIGHT vocoder: excitation construction ⇒ no phase manipulation case

parameters

Excitation

phaseZero

Randomphase

Voiced

Excitation

Noise

Aperiodicity

Mixed

Pulse

UnvoicedExcitation

ω6 T (ejω)

ω

ω

ω

ω

×

×

IFFT

1

1

∣W(

ejω

)∣

∣Hp

(

ejω

)∣

∣Hn

(

ejω

)∣

e(n)

∣T(

ejω

)∣

ω

V(

ejω

)

U(

ejω

)

E(

ejω

)

6 W(

ejω

)

20

Page 21: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

STRAIGHT vocoder for statistical parametric synthesis

I Aperiodicity parameters extracted and averaged overspecified frequency sub-bands

I Band-aperiodicity parameters (BAP)I At synthesis time the generated BAP are converted in

aperiodicityI Speech is synthesized in the frequency domainI Achieves very good qualityI Additional parameters for acoustic modeling

I BAP: usually 5 coefficients

21

Page 22: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Pulse and noise weighting filters

I Filters Hp

(ejω

)and Hn

(ejω

)shape the pulse and noise

inputs, just like in MELPI Frequency responses are obtained from the aperiodicity

parameters a(w)∣∣Hp

(ejω

)∣∣ =√

1− a(w) 0 ≤ ω ≤ π

∠Hp

(ejω

)=0 0 ≤ ω ≤ π∣∣Hn

(ejω

)∣∣ =√

a(w) 0 ≤ ω ≤ π

∠Hn

(ejω

)=0 0 ≤ ω ≤ π

22

Page 23: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Band aperiodicity parameters

I Aperiodicity at frequency ω

a(ω) =

∫wERB (λ;ω) |S

(ejλ

)|2Υ

(|SL(ejλ)|2

|SU(ejλ)|2

)dλ∫

wERB (λ;ω) |S (ejλ) |2dλ

I˛S

`ejω

´˛: speech spectral envelope

I˛SU

`ejω

´˛: envelope constructed by connecting the peaks of

˛S

`ejω

´˛I

˛SL

`ejω

´˛: envelope constructed by connecting the valleys of

˛S

`ejω

´˛I wERB (λ; ω): auditory filter to smooth

˛S

`ejω

´˛I Υ(·): look-up table operation

I Band-aperiodicity

bj =1

Ωj

∫Ωj

a(ω)dω

I Ωj : j-th frequency band

23

Page 24: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Aperiodicity and band aperiodicity: examples

I 5 bands: 0-1kHz, 1-2kHz, 2-4kHz, 4-6kHz, 6-8kHz

0 1000 2000 3000 4000 5000 6000 7000 8000Frequency (Hz)

Log

Mag

nitu

de

AperiodicityBand−aperiodicity

I 24 Bark critical bands

0 1000 2000 3000 4000 5000 6000 7000 8000Frequency (Hz)

Log

Mag

nitu

de

AperiodicityBand−aperiodicity

24

Page 25: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

25

Page 26: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

State-dependent mixed excitation [Maia et al., 2007]

Generator

Trained HMMs

ExcitationVoiced

ExcitationUnvoiced

MixedExcitation

Labels

Noise

HMM state sequence

Filters

Spectral parameters(generated)

State durations

SpeechSynthesized

Pulse train

· · ·

v(n)

H(z)

Hu(z)

Hv(z)

State s1 State sQ−1

w(n)u(n)

e(n)

· · ·

F0(0)0

· · ·

F0(Q−1)0 F0

(Q−1)T−1 F0 (generated)

State s0

t(n)

H(0)v (z), H

(0)u (z) H

(Q−1)v (z), H

(Q−1)u (z)

· · ·· · ·

H(1)v (z), H

(1)u (z)

· · ·F0(1)0· · ·

· · ·

· · ·c(0)0 c

(1)0c

(0)T−1

F0(0)T−1

F0(1)T−1

c(Q−1)T−1· · ·c

(Q−1)0c

(1)T−1

· · ·

Filters: Hv(z) =

M/2Xm=−M/2

h(m)z−m , Hu(z) =1PL

l=0 g(l)z−l

26

Page 27: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

State-dependent mixed excitation: training

VoicedExcitation

UnvoicedExcitation

...

(target signal)Residual

(error signal)

White noiseG(z) = 1

Hu(z)

u(n)p0 p1 p2

Hv(z)

w(n)

Pulse train t(n)a0

a1

a2

e(n)

v(n)aZ−1

pZ−1

I Filter coefficients

h =[h

(−M

2

)· · · h

(M2

)]>, g =

[g(0) · · · g(L)

]>I And pulse positions and amplitudes

p0, . . . , pJ−1, a0, . . . , aJ−1

I Are optimized in a way that

h, g, t(n) = arg maxg,h,t(n)

P (e(n) | g,h, t(n))

27

Page 28: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

State-dependent mixed excitation: synthesis

Excitation

noiseWhite

Pulse traingenerator

State sequence

Timemodulation

t(n) v(n)

F0

H(q)v (z)

u(n)

sq = s0, . . . , sQ−1

αv(n)

α

w(n)

ρ(n)

e(n)

u(n)H

(q)u (z) HHP (z)

I Noise component is colored through1. High-pass filtering (Fc = 2kHz)2. Time modulation with a pitch-synchronous triangular window: ρ(n)

I Hv(z) is normalized in energyI Gain α adjusts the energy of the voiced component so that the power of

the excitation signal e(n) becomes one

28

Page 29: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Voiced filter effect

0 10 20 300

2

4

6

8

10

Time (ms)

Am

plitu

de

Pulse train

−5 0 5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

Time (ms)

Am

plitu

de

Voiced filter

0 10 20 30

−8

−6

−4

−2

0

2

4

Time (ms)

Voiced excitation

Am

plitu

de0 2000 4000 6000 8000

−20

−10

0

10

20

30

Frequency (Hz)

Mag

nitu

de (

dB)

Pulse train

0 2000 4000 6000 8000−40

−30

−20

−10

0

10

Frequency (Hz)

Mag

nitu

de (

dB)

Voiced filter

0 2000 4000 6000 8000−40

−20

0

20

40

Frequency (Hz)

Mag

nitu

de (

db)

Voiced excitation

29

Page 30: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Unvoiced filter effect

0 10 20 30−3

−2

−1

0

1

2

Time (ms)

Am

plitu

de

White noise

0 100 200

−0.2

0

0.2

0.4

0.6

0.8

1

Time (ms)

Am

plitu

de

Unvoiced filter

0 10 20 30

−3

−2

−1

0

1

2

Time (ms)

Unvoiced excitation

Am

plitu

de0 2000 4000 6000 8000

0

10

20

30

40

Frequency (Hz)

Mag

nitu

de (

dB)

White noise

0 2000 4000 6000 8000−4

−2

0

2

4

6

Frequency (Hz)

Mag

nitu

de (

dB)

Unvoiced filter

0 2000 4000 6000 8000

0

10

20

30

40

Frequency (Hz)M

agni

tude

(db

)

Unvoiced excitation

30

Page 31: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Deterministic plus stochastic residual modeling [Drugman et al., 2009]

I Assumed model of the LP residual e(n)

e(n) = ed(n) + es(n)

I ed(n): deterministic partI es(n): stochastic part

I Maximum voiced frequency Fm

I Boundary between deterministic and stochasticcomponents

I Set to 4 kHz

31

Page 32: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Deterministic modeling: eigenresidual calculation

parametersSpectral

Windowing

Residual

periodsPitch

segmentsResidual

Resamplingand energy

normalizationextraction

Inversefiltering

GCIdetection

dataSpeech

PCAEigenresidualsF0

I Normalized frequency F ∗0 for resampling the residual

segments

F ∗0 ≤

FNyquist

FmF0,min

I PCA: eigenresiduals explain about 80% of the totaldispersion

32

Page 33: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Stochastic modeling

I Stochastic component model

es(n) = ρ(n) [hu(n) ∗ w(n)]

I ρ(n): pitch synchronous modulation windowI hu(n): AR filter impulse responseI w(n): white noise

I Unvoiced filter hu(n)I FixedI Auto-regressive (all-pole)I Coefficients obtained through LP analysis

33

Page 34: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Application to statistical parametric synthesis

I Additional parameters for acoustic modelingI PCA weights: 15

I Use of eigenresidual of superior ranks makes no difference=⇒ Optionally, the stream of PCA weights can be removed

I Synthesis part

ResamplingcombinationLinear

weightsPCA

modulationTime

Whitenoise

OLA

Eigenresiduals

Hu(z)w(n)

rd(n)

rs(n)

e(n)

ρ(n)

F0

34

Page 35: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Waveform interpolation [Sung et al., 2010]

I Waveform interpolation (WI)I Each cycle of the excitation signal represented by a

characteristic waveform (CW)

e(n) =

P/2∑k=0

[Ak cos

(2πkn

P

)+ Bk sin

(2πkn

P

)]I Ak, Bk: discrete-time Fourier series coefficientsI P : pitch period

I CW extracted from the LP residual at a fixed rateI Information to reconstruct the excitation signal

I Pitch period: PI CW coefficients: Ak, Bk

35

Page 36: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Application to statistical parametric synthesis

I Analysis is similar to eigenresidual calculation [Drugmanet al., 2009]

PCAanalysisWI

Speechdata

CW coefficients Basis vector

computation

coefficientscoefficient

Basis vectorBasis vector

selection

I Additional parameters for acoustic modelingI Coefficients of the basis vectors: 8

I Synthesis

synthesis

Pitch

SpeechCW

reconstruction

coefficientsBasis vector

WI

36

Page 37: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

37

Page 38: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Glottal inverse filtering [Raitio et al., 2008]

I Uses Iterative Adaptive Inverse Filtering [Alku, 1992]

Harmonic−to−noise ratio

Glottal inverse

LP analysis

extractionVocal tract(HNR)

extraction

Energy

spectrumEstimated voicesource signal

extraction

Speech signal

Features

Voice sourcespectrum

filtering (IAIF)

V (z) g(n)

F0

G(z)

I Features for acoustic modeling1. F0

2. Energy3. HNR in 4 bands: 0-2kHz, 2-4kHz, 4-6kHz,6-8kHz4. Voice source spectrum =⇒ glottal flow =⇒ 10 LSPs5. Vocal tract spectrum: 30 LSPs

38

Page 39: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

At synthesis time

Energy

Gain

Gain

InterpolationSpectral

macthingNoise

addition

HNR Glottal source spectrum

pulseLibrary

Whitenoise

Excitation

V

U

F0

G(z)

L(z)

I Library pulse extracted from the speech data throughglottal inverse filtering

I Noise is separately added to each band of the voicedexcitation according to HNR

I G(z) implements the spectral shape of the glottal pulseI L(z) is a fixed lip radiation filter

39

Page 40: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Glottal spectrum separation (GSS) [Cabral et al., 2008]

I Speech production model

S(ejω

)= D

(ejω

)G

(ejω

)V

(ejω

)R

(ejω

)D

`ejω

´: pulse train G

`ejω

´: glottal pulse

V`ejω

´: vocal tract R

`ejω

´: lip radiation

I Simplified speech production model

S(ejω

)= D

(ejω

)V

(ejω

)I What GSS does

1. Estimate a model of the glottal flow derivative =⇒ E`ejω

´2. Remove its effect from the speech spectral envelope =⇒ H

`ejω

´V

“ejω

”=

H`ejω

´E (ejω)

3. Re-synthesize speech

S“ejω

”= D

“ejω

”E

“ejω

” H`ejω

´E (ejω)

= S“ejω

”= D

“ejω

”V

“ejω

40

Page 41: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Glottal flow model utilized in GSS

I LF model is used to represent the glottal pulse

Time

Ta

tcte ta

t0 tp

Ee

T0

0

I Parameters

tc: instant of complete closure tp: instant of maximum flowte: instant of maximum excitation Ta = ta − te

T0: fundamental period Ee: amplitude of maximum excitation

41

Page 42: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Application to statistical parametric synthesis

LF−modelparameterization separation

Spectral

Glottal flowderivative

LF−modelparameters

FFT

Spectralparameters

Inverse filtering STRAIGHT

Speech

Features

Speech spectralenvelope

ELF

(

ejω

)

eLF (n)

V(

ejω

)

=H(ejω)

ELF (ejω)

H(

ejω

)

I Acoustic modeling1. Spectral parameters: mel-cepstral coefficients2. Band aperiodicity parameters3. 5 LF-model parameters: te, tp, Ta, Ee, T0

42

Page 43: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Synthesis part

parametersAperiodicity

LF−modelgenerator

parametersLF−model

FFTWhite noise

IFFTMixedExcitation

FFT

ω

×

∣ELF

(

ejω

)∣

eLF (n)

×w(n) W(

ejω

)

∣Hp

(

ejω

)∣

∣Hn

(

ejω

)∣

ω

I STRAIGHT vocoder is utilizedI Original delta pulse is replaced by the pulse created by the

generated LF-model parameters

43

Page 44: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Vocal tract and LF-model plus noise separation [Lanchantin et al., 2010]

I Assumed speech model

S(ejω

)=

[D

(ejω

)G

(ejω

)+ W

(ejω

)]V

(ejω

)R

(ejω

)D

(ejω

): pulse train G

(ejω

): glottal pulse

W(ejω

): white noise V

(ejω

): vocal tract

R(ejω

): lip radiation

I Parameterization1. Fundamental frequency ⇒ D

(ejω

)2. LF model parameter ⇒ G

(ejω

)3. Noise power ⇒ W

(ejω

)4. Mel-cepstral coefficients ⇒ V

(ejω

)

44

Page 45: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Application to statistical parametric synthesis

1. Estimate a single parameter for a simplified LF model: Rd

2. Determine a maximum voiced frequency ωV U

3. Estimate power of the noise W(ejω

): σ2

g

4. Estimate vocal tract parameters

V“ejω

”=

8>><>>:τo

„S(ejω)

R(ejω)G(ejω)

«γ−1, w < ωV U

Co

„S(ejω)

R(ejω)G(ejωV U )

«Γ−1, w ≥ ωV U

I τo: cepstral analysis by fitting the harmonic peaksI Co: power cepstrumI Γ, γ: normalization terms

I Acoustic modeling1. Simplified LF-model parameter: Rd

2. Standard deviation of the noise component: σg

3. Cepstral coefficients that represent V(ejω

)4. F0

45

Page 46: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Synthesis time

I Excitation construction for voiced segments

modulationand time

WindowingFFT

FFT

ExcitationWhitenoise

Glottal pulsegenerator

×

E(

ejω)

g(n)

w(n)

∣HHP

(

ejω)∣

W(

ejω)

Rd

ωV U

G(

ejω)

I Excitation construction for unvoiced segments

FFTExcitation

WindowingnoiseWhite E

(

ejω

)

w(n)

I Synthesized speech signal

S(ejω

)= E

(ejω

)V

(ejω

)R

(ejω

)46

Page 47: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Vocoding methods: summary and examples

Method Description SampleSimple Pulse train/white noise simple switchMELP MELP mixed excitationSTRAIGHT STRAIGHT mixed excitationSDF State-dependent filtering mixed excitationDSM Deterministic plus stochastic of the

residualWI Waveform interpolation to statistical

parametric synthesisGlottiHMM Glottal inverse filteringGSS Glottal source separationSVLN Separation of vocal tract and

LF-model plus noise

47

Page 48: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

48

Page 49: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Joint vocoding-acoustic modeling

I The goal of any speech synthesizer is to reproduce thespeech waveform

I Parameters of a joint acoustic-excitation model are estimated bymaximizing the probability of the speech waveform

Speech waveform Joint acoustic−excitationmodel training

Spectral parameteras hidden variable

λs =

[

s(0) · · · s(N − 1)]⊤

λ = arg maxλ

p (s | ℓ, λ)

= arg maxλ

p (s | c, λ) p (c | ℓ, λ) dc

c =[

c⊤0· · · c

⊤T−1

]⊤

I λ = λc, λe: acoustic-excitation model parametersI λc: acoustic model partI λe: excitation model part

49

Page 50: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Another viewpoint: waveform-level modeling

I Comparison with typical modeling for parametric synthesisTypical: state sequenceis hidden variable, spectrumis the observation

spectrum are hidden variables,speech is the observation

New model: state sequence and

s

c

q q

c

I Similar conceptsI [Toda and Tokuda, 2008]: factor analyzed trajectory HMM

for spectral estimationI [Wu and Tokuda, 2009]: closed-loop training for

HMM-based synthesis

50

Page 51: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

A close look into the probabilities involved

I Typical statistical modeling for parametric synthesis

λc = arg maxλc

X∀q

p (c | q, λc) p (q | `, λc)

I Augmented statistical modeling

λ = arg maxλ

X∀q

Zp (s | c, q, λ)| z p (c | q, λ) p (q | `, λ)| z dcww­ ww­Speech generative Can be modeled by

model existing machines, e.g.

(speech production HMM, HSMM, trajectory HMM

from spectrum)

51

Page 52: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

One possible speech generative model

Excitation

... VoicedExcitation

UnvoicedExcitation(zero mean, variance one)

Gaussian white noise

Speech

Pulse trainv(n)

e(n)

u(n)

w(n)Hu(z)

p0 pZ−1p2p1

a0

a1

a2 aZ−1 Hv(z)

s(n)

t(n)

Hc(z)

Probability of the speech signal

p (s | Hc, q, λe) = |Hc|−1N(H−1

c s;Hv,qt,Φq

)Φq =

(G>

q Gq

)−1

λe = Hv,G, t : excitation model parameters

52

Page 53: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Vocal tract filter impulse response and spectralparameters: relationship

I We need p (s | c, q, λ), not p (s | Hc, q, λ)I Mapping between Hc and c is necessary!

I Two possibilities1. Relationship between Hc and c represented as a Gaussian

process2. Relationship between Hc and c is deterministic

I Cepstral coefficients!

53

Page 54: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Acoustic modeling

I Trajectory HMM [Zen et al., 2007b] for acoustic modeling

p (c | `, λc) =∑

q p (c | q, λc)︸ ︷︷ ︸ p (q | `, λc)︸ ︷︷ ︸y yN (c ; cq,Pq) πq0

∏T−1t=0 αqtqt+1

cq = Pqrq

Rq = P−1q = W>Σ−1

q W

rq = W>Σ−1q µq

W : append dynamic features to c

I Why: modeling of p (c | `, λc) instead of p (Wc | `, λc)(conventional HMM)

54

Page 55: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Joint acoustic-excitation model

Final joint model

p (s | `, λ) =∑

q

∫p (s | c, q, λe)︸ ︷︷ ︸ p (c | q, λc) p (q | `, λc)︸ ︷︷ ︸ dcww ww

Excitation model Acoustic modelww ww|Hc|−1 N

(H−1

c s;Hv,qt,Φq

)N (c; cq,Pq) πq0

∏T−1t=0 αqtqt+1

55

Page 56: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Training procedure

sequenceState

sequenceState

Trajectory HMMtraining

Excitation modeltraining

Trajectory HMMtraining

Excitation modeltraining

analysis

Estimation of the best

Initial cepstralSpeech

Recursion

Initialization

cepstral coefficients

q

q

Hv(z), G(z), t(n)Rq, rq

Initial epstrum ve tor c Initial epstrum ve tor c

Cepstrum ve tor cCepstrum ve tor c

56

Page 57: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Contents

Introduction

Vocoding methods for statistical parametric speech synthesisFully parametric excitation methodsMethods that attempt to mimic the LP residualMethods that work on source and vocal tract modeling

Joint acoustic modeling and waveform generation for statisticalparametric speech synthesis

Conclusion

57

Page 58: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Conclusions

I Quality improvement of statistical parametric synthesizersthrough better waveform generation methods

I Existing approaches that use source-filter modeling can beclassified into

I Methods that attempt to improve the excitation signal solelyI Methods that focus on the speech production model as a

wholeI Naturalness degradation of statistical parametric

synthesizers has basically two causesI Acoustic modeling that produces averaged parameter

trajectoriesI Use of parametric speech production models

I Methods which can integrate both acoustic modeling andspeech production

58

Page 59: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

Acknowledgments

Many thanks toI João Cabral, from University College Dublin, IrelandI Tuomo Raitio, from Helsinki University of Technology,

FinlandI Thomas Drugman, from Faculté Polytechnique de Mons,

BelgiumI Pierre Lanchantin, from IRCAM, FranceI June Sig Sung, from Seoul National University, South

Koreafor proving samples of their systems.

59

Page 60: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

References IAlku, P. (1992).Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering.Speech Communication, 11(2–3):109–118.

Cabral, J., Renals, S., Richmond, K., and Yamagishi, J. (2008).Glottal spectral separation for parametric speech synthesis.In Proc. of Interspeech, pages 1829–1832.

Deller, Jr., J. R., Hansen, J. H. L., and Proaks, J. G. (2000).Discrete-Time Processing of Speech Signals.IEEE Press Classic Reissue, New York.

Drugman, T., Wilfart, G., and Dutoit, T. (2009).A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis.In Proc. of Interspeech, pages 1779–1782.

Lanchantin, P., Degottex, G., and Rodet, X. (2010).An HMM-based speech synthesis system using a new glottal source and vocal-tract separation method.In Proc. of ICASSP, pages 4630–4633.

Maia, R., Toda, T., Zen, H., Nankaku, Y., and Tokuda, K. (2007).An excitation model for HMM-based speech synthesis based on residual modeling.In Proc. of the 6th ISCA Workshop on Speesh Synthesis, pages 131–136.

Raitio, T., Suni, A., Pulakka, H., Vainio, M., and Alku, P. (2008).HMM-based Finnish text-to-speech system using glottal inverse filtering.In Interspeech, pages 1881–1884.

60

Page 61: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for

References II

Sung, J., Kyung, D., Oh, H., and Kim, N. (2010).Excitation modeling based on waveform interpolation for HMM-based speech synthesis.In Proc. of Interspeech, pages 813–816.

Toda, T. and Tokuda, K. (2008).Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM.In Proc. of ICASSP, pages 3925–3928.

Wu, Y. J. and Tokuda, K. (2009).Minimum generation error training by using original spectrum as reference for log spectral distortionmeasure.In Proc. of ICASSP, pages 4013–4016.

Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T. (2001).Mixed-excitation for HMM-based speech synthesis.In Proc. of EUROSPEECH.

Zen, H., Toda, T., Nakamura, M., and Tokuda, K. (2007a).Details of the Nitech HMM-based speech synthesis for Blizzard Challenge 2005.IEICE Trans. on Inf. and Systems, E90-D(1):325–333.

Zen, H., Tokuda, K., and Kitamura, T. (2007b).Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamicfeature vector sequence.Computer Speech and Language, 21(1):153–173.

61

Page 62: Vocoding approaches for statistical parametric speech synthesismi.eng.cam.ac.uk › foswiki › pub › Main › SeminarsSpeech › vo... · 2011-03-04 · Vocoding approaches for