Top Banner
E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 1 EE E6820: Speech & Audio Processing & Recognition Lecture 6: Speech modeling and synthesis Modeling speech signals Spectral and cepstral models Linear Predictive models (LPC) Other signal models Speech synthesis Dan Ellis <[email protected]> http://www.ee.columbia.edu/~dpwe/courses/e6820-2001-01/ 1 2 3 4 5
44

Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

Mar 03, 2019

Download

Documents

vandieu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 6:Speech modeling and synthesis

Modeling speech signals

Spectral and cepstral models

Linear Predictive models (LPC)

Other signal models

Speech synthesis

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/courses/e6820-2001-01/

1

2

3

4

5

Page 2: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 2

The speech signal

• Elements of the speech signal:

- spectral resonances (formants, moving)- periodic excitation (voicing, pitched) + pitch contour- noise excitation (fricatives, unvoiced, no pitch)- transients (stop-release bursts)- amplitude modulation (nasals, approximants)- timing!

1

watch thin as a dimeahas

mdnctcl

^

θ zwzh e

III ayε

Page 3: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 3

The source-filter model

• Notional separation of:

source

:

excitation, fine t-f structure

& filter

:

resonance, broad spectral structure

• More a modeling approach than a model

Glottal pulsetrain

Fricationnoise

Vocal tractresonances+ Radiation

characteristicSpeechVoiced/

unvoiced

Pitch

Formants

Source Filter

t

f

t

Page 4: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 4

Signal modeling

• Signal models are a kind of

representation

- to make some aspect explicit- for efficiency- for flexibility

• Nature of model depends on goal

- classification: remove irrelevant details- coding/transmission: remove perceptual

irrelevance- modification: isolate control parameters

• But commonalities emerge

- perceptually irrelevant detail (coding) will also be irrelevant for classification

- modification domain will usually reflect ‘independent’ perceptual attributes

- getting at the

abstract information

in the signal

Page 5: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 5

Different influences for signal models

• Receiver:

- see how signal is treated by listeners

cochlea-style filterbank models

• Transmitter (source)

- physical apparatus can generate only a limited range of signals...

LPC models of vocal tract resonances

• Making explicit particular aspects

- compact, separable resonance correlates

cepstrum- modeling prominent features of NB spectrogram

sinusoid models- addressing unnaturalness in synthesis

H+N model

Page 6: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 6

Applications of (speech) signal models

• Classification / matchingGoal: highlight important information

- speech recognition (lexical content)- speaker recognition (identity or class)- other signal classification- content-based retrieval

• Coding / transmission / storageGoal: represent just enough information

- real-time transmission e.g. mobile phones- archive storage e.g. voicemail

• Modification/synthesisGoal: change certain parts independently

- speech synthesis / text-to-speech(change the words)

- speech transformation / disguise(change the speaker)

Page 7: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 7

Outline

Modeling speech signals

Spectral and cepstral models

- Auditorily-inspired spectra- The cepstrum- Feature correlation

Linear predictive models (LPC)

Other models

Speech synthesis

1

2

3

4

5

Page 8: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 8

Spectral and cepstral models

• Spectrogram seems like a good representation

- long history- satisfying in use- experts can ‘read’ the speech

• What is the information?

- intensity in time-frequency cells;typically 5ms x 200 Hz x 50 dB

Discarded information:

- phase- fine-scale timing

• The starting point for other representations

2

Page 9: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 9

The filterbank interpretation of the short-time Fourier transform (STFT)

• Can regard spectrogram rows as coming from separate bandpass filters:

• Mathematically:

where

sound

f

X k n0,[ ] x n[ ] w n n0–[ ] j2πk n n0–( )

N-----------------------------

–exp⋅ ⋅n∑=

x n[ ] hk n0 n–[ ]⋅n∑=

hk n[ ] w n–[ ] j2πkn

N-------------

exp⋅=n

hk[n]w[-n]

ω

Hk(ejω)W(ej(ω − 2πk/N))

2πk/N

Page 10: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 10

Spectral models: Which bandpass filters?

• Constant bandwidth? (analog / FFT)

• But: cochlea physiology & critical bandwidths→ use actual bandpass filters in ear models

& choose bandwidths by e.g. CB estimates

• Auditory frequency scales- constant ‘Q’ (center freq/bandwidth), mel, Bark...

Page 11: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 11

Gammatone filterbank

• Given bandwidths, which filter shapes ?- match inferred temporal integration window- match inferred spectral shape (sharp hi-F slope)- keep it simple (since it’s only approximate)

→ Gammatone filters

- 2N poles, 2 zeros, low complexity- reasonable linear match to cochlea

h n[ ] nN 1– bn–exp ωin( )cos⋅ ⋅=time →

10050 200 500 1000 2000 5000

-10

0

-30

-20

-40

-50

mag

/ dB

freq / Hz

z plane

2

2

2

2

logaxis!

Page 12: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 12

Constant-BW vs. cochlea model

• Magnitude smoothed over 5-20 ms time window

• Spectrograms:• Frequency responses:

FFT-based WB spectrogram (N=128)

freq

/ H

z

0 0.5 1 1.5 2 2.5 30

2000

4000

6000

8000

Q=4 4 pole 2 zero cochlea model downsampled @ 64

freq

/ H

z

time / s0 0.5 1 1.5 2 2.5 3

100

200

500

1000

2000

5000

-50

-40

-30

-20

-10

0Effective FFT filterbank

Gai

n / d

B

-50

-40

-30

-20

-10

0

Gai

n / d

B

Gammatone filterbank

0 1000 2000 3000 4000 5000 6000 7000 8000

0 1000 2000 3000 4000 5000 6000 7000 8000Freq / Hz

linear axis

Page 13: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 13

Limitations of spectral models

• Not much data thrown away- just fine phase/time structure (smoothing)- little actual ‘modeling’- still a large representation!

• Little separation of features- e.g. formants and pitch

• Highly correlated features- modifications affect multiple parameters

• But, quite easy to reconstruct- iterative reconstruction of lost phase

Page 14: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 14

The cepstrum

• Original motivation: Assume a source-filter model:

• Define ‘Homomorphic deconvolution’:- source-filter convolution: g[n]*h[n]

- FT → product G(ejω)·H(ejω)

- log → sum: logG(ejω) + logH(ejω)- IFT

→ separate fine structure: cg[n] + ch[n]

= deconvolution

• Definition:

Real cepstrum

Excitationsource

t

tResonancefilter

f

cn idft dft x n[ ]( )log( )=

Page 15: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 15

Stages in cepstral deconvolution

• Original waveform has excitation fine structure convolved with resonances

• DFT shows harmonics modulated by resonances

• Log DFT is sum of harmonic ‘comb’ and resonant bumps

• IDFT separates out resonant bumps (low quefrency) and regular, fine structure (‘pitch pulse’)

• Selecting low-n cepstrum separates resonance information (deconvolution / ‘liftering’)

0 100 200 300 400-0.2

0

0.2Waveform and min. phase IR

samps

0 1000 2000 30000

10

20abs(dft) and liftered

freq / Hz

freq / Hz0 1000 2000 3000

-40

-20

0

log(abs(dft)) and liftered

0 100 200

0

100

200 real cepstrum and lifter

quefrency

dB

pitch pulse

Page 16: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 16

Properties of the cepstrum

• Separate source (fine) & filter (broad structure)- smooth the log mag spectrum to get resonances

• Smoothing spectrum is filtering along freq.- i.e. convolution applied in Fourier domain

→ multiplication in IFT (‘liftering’)

• Periodicity in time → harmonics in spectrum→ ‘pitch pulse’ in high-n cepstrum

• Low-n cepstral coefficients are DCT of broad filter / resonance shape:

cn X ejω( )log nωcos j nωsin+( )⋅ ωd∫=

0 1000 2000 3000 4000 5000 6000 7000

-0.1

0

0.1

0 1 2 3 4 5-1

0

1

2

5th order Cepstral reconstructionCepstral coefs 0..5

Page 17: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 17

Aside: Correlation of elements

• Cepstrum is a popular in speech recognition- feature vector elements are decorrelated:

- c0 ‘normalizes out’ average log energy

• Decorrelated pdfs fit diagonal Gaussians- simple correlation is a waste of parameters

• DCT is close to PCA for spectra?

frames

5

10

15

20

25

4

8

12

16

20

5 10 15 20

4

8

12

16

20

-5

-4

-3

-2

50 100 150

Cep

stra

l co

effi

cien

tsA

ud

ito

ry

spec

tru

m

Covariance matrixFeatures Example joint distrib (10,15)

2468

1012141618

-5 0 5-4-3-2-10123

Page 18: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 18

Outline

Modeling speech signals

Spectral and cepstral modes

Linear Predictive models (LPC)- The LPC model- Interpretation & application- Formant tracking

Other models

Speech synthesis

1

2

3

4

5

Page 19: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 19

Linear predictive modeling (LPC)

• LPC is a very successful speech model- it is mathematically efficient (IIR filters)- it is remarkably successful for voice

(fits source-filter distinction)- it has a satisfying physical interpretation

(resonances)

• Basic math- model output as lin. function of previous outputs:

... hence “linear prediction” (pth order)

- e[n] is excitation (input), a/k/a prediction error

... all-pole modeling, ‘autoregression’ (AR) model

3

s n[ ] ak s n k–[ ]⋅k 1=

p∑( ) e n[ ]+=

S z( )E z( )----------- 1

1 ak z k–⋅k 1=

p∑–( )--------------------------------------------- 1

A z( )-----------= =

Page 20: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 20

Vocal tract motivation for LPC

• Direct expression of source-filter model:

• Acoustic tube models suggest all-pole model for vocal tract

• Relatively slowly-changing- update A(z) every 10-20 ms

• Not perfect: Nasals introduce zeros

s n[ ] ak s n k–[ ]⋅k 1=

p∑( ) e n[ ]+=

Pulse/noiseexcitation

Vocal tract

e[n] s[n]H(z) = 1/A(z)

z-plane

H(z)

f

|H(ejω)|

Page 21: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 21

Estimating LPC parameters

• Minimize short-time squared prediction error:

Differentiate w.r.t. ak to get:

where

are correlation coefficients

• p linear equations to solve for all ajs...

E e2

n[ ]n 1=m∑ s n[ ] aks n k–[ ]

k 1=

p∑– 2

n∑= =

2 s n[ ] ajs n j–[ ]j 1=

p∑–( ) s n k–[ ]–( )⋅n∑ 0=

s n[ ] s n k–[ ]n∑ ajj∑ s n j–[ ] s n k–[ ]

n∑⋅=

φ 0 k,( ) ajj∑ φ j k,( )⋅=

φ j k,( ) s n j–[ ] s n k–[ ]n 1=m∑=

Page 22: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 22

Evaluating parameters

• Linear equations

• If s[n] is assumed zero outside some window

Hence equations become:

• Toeplitz matrix (equal antidiagonals)→ can use Durbin recursion to solve

• (Solve full via Cholesky)

φ 0 k,( ) ajj 1=

p∑ φ j k,( )⋅=

φ j k,( ) s n j–[ ] s n k–[ ]n∑ r j k–( )= =

r 1( )r 2( )…

r p( )

r 0( ) r 1( ) … r p 1–( )r 1( ) r 2( ) … r p 2–( )… … … …

r p 1–( ) r p 2–( ) … r 0( )

a1

a2

…ap

=

φ j k,( )

Page 23: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 23

LPC illustration

• Actual poles:

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz

time / samp

-60

-40

-20

0

0 50 100 150 200 250 300 350 400

dB

windowed original

original spectrum

LPC residual

residual spectrum

LPC spectrum

-0.3

-0.2

-0.1

0

0.1

z-plane

Page 24: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 24

Interpreting LPC

• Picking out resonances- if signal really was source + all-pole resonances,

LPC should find the resonances

• Least-squares fit to spectrum

- minimizing e2[n] in time domain is the same as

minimizing E2(ejω) (by Parseval)→close fit to spectral peaks; valleys don’t matter

• Removing smooth variation in spectrum- 1/A(z) is low-order approximation to S(z)

-

- hence, residual E(z) = A(z)S(z) is ‘flat’ version of S

• Signal whitening:- white noise (independent x[n]s) has flat spectrum→whitening removes temporal correlation

S z( )E z( )----------- 1

A z( )-----------=

Page 25: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 25

Alternative LPC representations

• Many alternate p-dimensional representations:

- coefficients {ai}

- roots {λi} :

- line spectrum frequencies...

- reflection coefficients {ki} from lattice form

- tube model log area ratios

• Choice depends on:- mathematical convenience/complexity- quantization sensitivity- ease of guaranteeing stability- what is made explicit- distributions as statistics

1 λ i z1–

–( )∏ 1 aiz1–∑–=

gi

1 ki–

1 ki+-------------

log=

Page 26: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 26

LPC Applications

• Analysis-synthesis (coding, transmission):

-

hence can reconstruct by filtering e[n] with {ai}s

- whitened, decorrelated, minimized e[n]sare easy to quantize

- .. or can model e[n] e.g. as simple pulse train

• Recognition/classification- LPC fit responds to spectral peaks (formants)- can use for recognition (convert to cepstra?)

• Modification- separating source and filter supports cross-

synthesis- pole / resonance model supports ‘warping’

(e.g. male → female)

S z( ) E z( )A z( )-----------=

Page 27: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 27

Aside: Formant tracking

• Formants carry (most?) linguistic information

• Why not classify → speech recognition ?- e.g. local maxima in cepstral-liftered spectrum

pole frequencies in LPC fit

• But: recognition needs to work in all circumstances- formants can be obscure or undefined

→ Need more graceful, robust parameters

freq

/ H

zfr

eq /

Hz

0

1000

2000

3000

4000

time / s0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

Original (mpgr1_sx419)

Noise-excited LPC resynthesis with pole freqs

Page 28: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 28

Outline

Modeling speech signals

Spectral and cepstral modes

Linear predictive models (LPC)

Other models- Sinewave modeling- Harmonic+Noise model (HNM)

Speech synthesis

1

2

3

4

5

Page 29: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 29

Other models:Sinusoid modeling

• Early signal models required low complexity- e.g. LPC

• Advances in hardware open new possibilities...

• NB spectrogram suggests harmonics model:

- ‘important’ info in 2-D surface is set of tracks?- harmonic tracks have ~ smooth properties- straightforward resynthesis

4

time / s

freq

/ H

z

0 0.5 1 1.50

1000

2000

3000

4000

Page 30: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 30

Sine wave models

• Model sound as sum of AM/FM sinusoids:

- Ak, ωk, φk piecewise linear or constant

- can enforce harmonicity: ωk = k.ω0

• Extract parameters directly from STFT frames:

- find local maxima of |S[k,n]| along frequency- track birth/death & correspondence

s n[ ] Ak n[ ] n ωk n[ ]⋅ φk n[ ]+( )cosk 1=

N n[ ]

∑=

freq

time

mag

Page 31: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 31

Finding sinusoid peaks

• Look for local maxima along DFT frame- i.e. |S[k-1,n]| < |S[k,n]| > |S[k+1,n]|

• Want exact frequency of implied sinusoid- DFT is normally quantized quite coarsely

e.g. 4000 Hz / 256 bins = 15.6 Hz- interpolate at peaks via quadratic fit?

- may also need interpolated unwrapped phase

• Or, use differential of phase along time (pvoc):

- where S[k,n] = a + jb

mag

nitu

de

frequency

spectral samples

quadratic fit to 3 points

interpolated frequency and magnitude

ω ab ba–

a2 b2+------------------=

Page 32: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 32

Sinewave modeling applications

• Modification (interpolation) & synthesis- connecting arbitrary ω & φ requires

cubic phase interpolation (because )

• Types of modification- time & frequency scale modification

.. with or without changing formant envelope- concatenation/smoothing boundaries- phase realignment (for crest reduction)

• Non-harmonic signals? OK-ish

ω φ=

time / s

freq

/ H

z

0 0.5 1 1.50

1000

2000

3000

4000

Page 33: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 33

Harmonics + noise model

• Motivation to modify sinusoid model because:- problems with analysis of real (noisy) signals- problems with synthesis quality (esp. noise)- perceptual suspicions

• Model:

- sinusoids are forced to be harmonic- remainder is filtered & time-shaped noise

• ‘Break frequency’ Fm[n] between H and N:

s n[ ] Ak n[ ] n k ω0 n[ ]⋅ ⋅( )cosk 1=

N n[ ]

∑ e n[ ] hn n[ ] b n[ ]⊗( )⋅+=

Harmonics Noise

Harmonicity limitFm[n]

HarmonicsNoise

freq / Hz0

40

dB

0

20

1000 2000 3000

Page 34: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 34

HNM analysis and synthesis

• Dynamically adjust Fm[n] based on ‘harmonic test’:

• Noise has envelopes in time e[n] and freq Hn

- reconstruct bursts / synchronize to pitch pulses

time / s

freq

/ H

z

0 0.5 1 1.50

1000

2000

3000

4000

time / s0

0 040dB

1000

2000

3000freq / Hz

0.01 0.02 0.03

Hn[

k]

e[n]

Page 35: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 35

Outline

Modeling speech signals

Spectral and cepstral modes

Linear predictive models (LPC)

Other models

Speech synthesis- Phone concatenation- Diphone synthesis

1

2

3

4

5

Page 36: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 36

Speech synthesis

• One thing you can do with models

• Easier than recognition?- listeners do the work- .. but listeners are very critical

• Overview of synthesis

- normalization disambiguates text (abbreviations)- phonetic realization from pronouncing dictionary- prosodic synthesis by rule (timing, pitch contour)- .. all controls waveform generation

5

text speechTextnormalization

Synthesisalgorithm

Phonemegeneration

Prosodygeneration

Page 37: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 37

Source-filter synthesis

• Flexibility of source-filter model is ideal for speech synthesis

• Excitation source issues:- voiced / unvoiced / mixture ([th] etc.)- pitch cycle of voiced segments- glottal pulse shape → voice quality?

Glottal pulsesource

Noisesource

Vocal tractfilter+

SpeechVoiced/unvoiced

Pitchinfo

Phonemeinfo

t

t

t

t

th ax k ae t

Page 38: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 38

Vocal tract modeling

• Simplest idea:Store a single VT model for each phoneme

- but: discontinuities are very unnatural

• Improve by smoothing between templates

- trick is finding the right domain

th ax k ae ttime

freq

th ax k ae ttime

freq

Page 39: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 39

Cepstrum-based synthesis

• Low-n cepstrum is compact model of target spectrum

• Can invert to get actual VT IR waveform:

• All-zero (FIR) VT response→ can pre-convolve with glottal pulses

- cross-fading between templates is OK

cn idft dft x n[ ]( )log( )=

h n[ ] idft dft cn( )( )exp( )=

time

ee

ae

ah

Glottal pulseinventory Pitch pulse times (from pitch contour)

Page 40: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 40

LPC-based synthesis

• Very compact representation of target spectra- 3 or 4 pole pairs per template

• Low-order IIR filter → very efficient synthesis

• How to interpolate?- cannot just interpolate ai in a running filter

- but: lattice filter has better-behaved interpolation

• What to use for excitation- residual from original analysis- reconstructed periodic pulse train- parameterized residual resynthesis

+ +

z-1a1 kp-1

a2

a3

z-1

z-1

e[n]

+

e[n] s[n]

-1

s[n] +

k0

+

z-1z-1

z-1

+

- -

Page 41: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 41

Diphone synthesis

• Problems in phone-concatenation synthesis - phonemes are context-dependent- coarticulation is complex- transitions are critical to perception

→ store transitions instead of just phonemes

- ~40 phones → 800 diphones- or even more context if have a larger database

• How to splice diphones together?- TD-PSOLA: align pitch pulses and cross-fade- MBROLA: normalized, multiband

mdnctcl

^

θ zwzh e

III ayεPhones

Diphonesegments

Page 42: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 42

HNM synthesis

• High quality resynthesis of real diphone units+ parametric representation for modifications- pitch, timing modifications- removal of discontinuities at boundaries

• Synthesis procedure:- linguistic processing gives phones, pitch, timing- database search gives best-matching units- use HNM to fine-tune pitch & timing- cross-fade Ak and ω0 parameters at boundaries

• Careful preparation of database is key- sine models allow phase alignment of all units- larger database improves unit match

time

freq

Page 43: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 43

Generating prosody

• The real factor limiting speech synthesis?

• Waveform synthesizers have inputs for- intensity (stress)- duration (phrasing)- fundamental frequency (pitch)

• Curves produced by superposition of (many) inferred linguistic rules- phrase final lengthening, unstressed shortening..

• Or learn rules from transcribed examples

Page 44: Speech modeling and synthesis - Columbia Universitydpwe/classes/e6820-2001-01/lecnotes/e6820-L6... · Speech modeling and synthesis Modeling speech signals Spectral and cepstral models

E6820 SAPR - Speech models - Dan Ellis 2001-02-27 - 44

Summary

• Range of models:- spectral- cepstral- LPC- Sinusoid- HNM

• Range of applications:- general spectral shape (filterbank) → ASR- precise description (LPC+residual) → coding- pitch, time modification (HNM) → synthesis

• Issues:- performance vs. computational complexity- generality vs. accuracy- representation size vs. quality