Top Banner
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13, 2002 1
44

Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Mar 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Mel-Generalized Cepstral Representation of Speech

—A Unified Approach to Speech Spectral Estimation

Keiichi Tokuda

Nagoya Institute of Technology

Carnegie Mellon University

�Tamkang University

March 13, 2002

1

Page 2: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Conventional Speech Spectral Estimation

• Linear prediction (LPC) Autoregressive (AR) model

• Cepstral analysis Exponential (EX) model

• Subband filter bank Nonparametric

Variations

• Model ⇒ Pole-zero (ARMA) model

• Analysis window ⇒ Adaptive analysis

(sample by sample basis)

• Auditory characteristics ⇒ Warped LPC, PLP, etc.

◦ Auditory frequency scales (mel, Bark)

◦ Loudness scales (log, sone)

2

Page 3: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of This Talk

1. Conventional cepstral analysis

2. Introduction of generalized logarithmic function

⇒ Generalized cepstral analysis

3. Introduction of auditory frequency scale

⇒ Mel-generalized cepstral analysis

4. Applications to speech recognition and coding

3

Page 4: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

History of Cepstral Analysis

• B.P. Bogert, M.J.R. Healy, J.W. Tukey (1963)

Analysis of seismic signals

— decomposition into direct wave and echo

⇒ Cepstrum, Quefrency, Lifter

• A.M. Noll (1964, 1967)

Pitch extraction based on cepstrum

• A.V. Oppenheim (1966, 1968)

Homomorphic deconvolution

— decomposition into source and vocal tract function

⇒ Complex cepstrum

4

Page 5: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Definition of Cepstrum

Fourier transform of signal s(n)

S(ejω) = F [ s(n) ]

Cepstrum

C(m) = F−1[log |S(ejω)|2

](Bogert et al., Noll)

C(m) = F−1[log |S(ejω)|

](Oppenheim)

5

Page 6: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Complex Cepstrum

z-transform of signal s(n)

S(z) = Z [ s(n) ]

Complex cepstrum

c(m) = Z−1 [ logS(z) ]

= F−1[logS(ejω)

]= F−1

[log |S(ejω)| + j argS(ejω)

]

6

Page 7: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Cepstrum and Complex Cepstrum

log |S(ejω)| = F [C(m) ] = Re [F [ c(m) ] ]

⇓When it is minimum phase (all polesand zeros are located in the unit circle)

c(m) =

⎧⎪⎪⎨⎪⎪⎩

0, m < 0

C(m), m = 0

2C(m), m > 0

7

Page 8: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Homomorphic Deconvolution

Pulse train

White noise

��

�������e(n) s(n) = e(n) ∗ h(n)Linear

time-invariantsystemh(n)

� Speech

s(n) = h(n) ∗ e(n)

↓ F

S(ejω) = H(ejω)E(ejω)

↓ log | · |

log |S(ejω)| = log |H(ejω)| + log |E(ejω)|↓ F−1

C(m) = Ch(m) + Ce(m)

8

Page 9: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

0 100 200

0

Time (100μs)

0 1 2 3 4 50

20

40

60

80

100

Frequency (kHz)

Amplitude (dB)

0 50 1000

5

Quefrency (100μs)

0 1 2 3 4 50

20

40

60

80

100

Frequency (kHz)

Amplitude (dB)

C(m)

s(n)

0 1 2 3 4 5-50

0

50

Frequency (kHz)

Amplitude (dB)

(a)

(b) (c)

(d) (e)

9

Page 10: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Spectrum of Periodic Signal

h(n)

e(n)

e(n)

Np

w(n)

s(n)

Frequency0 π

2π/Np

Frequency0 π

⏐W(ejω)⏐

log⏐

Ew(e

jω) ⏐

log⏐

Sw(e

jω) ⏐

⏐H(ejω)⏐

10

Page 11: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Problems of Homomorphic Processing

(Cepstral Analysis)

Linear smoothing of log spectrum• affected by fine structure of FFT spectrum

• results in a large bias and variance

Voiced speech (periodic)• Envelope of peaks of spectral fine structure

⇒ Improved cepstral analysis , PSE: Biased

11

Page 12: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Cost Function

P (ω): Estimate of Power Spectrum

IN(ω): Periodogram

E =1

∫ π

−π

{IN(ω)

P (ω)− log

IN(ω)

P (ω)− 1

}dω ⇒ min

x: Gaussian Process ⇒ Maximizing p(x|c)

• Unbiased estimation of log spectrum

• equivalent to one used in LPC

• Minimization of energy of inverse filter output

12

Page 13: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of Natural Speech

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

(a) Unbiased cepstral analysis

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

(b) Linear prediction

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

13

Page 14: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Generalized Cepstrum

Complex Cepstrum

c(m) = Z−1 [ logS(z) ]

logS(z) = Z [ c(m) ]

Generalized Cepstrum

cγ(m) = Z−1 [ sγ (S(z)) ]

sγ (S(z)) = Z [ cγ(m) ]

14

Page 15: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Generalized logarithmic function

sγ(w) =

⎧⎨⎩ (wγ − 1)/γ, 0 < |γ| ≤ 1

logw, γ = 0

0 1

1

x

s xγ ( )γ = 1

γ = −1

0 1< <γ

− < <1 0γγ = 0 (log )x

15

Page 16: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Spectral Model

Generalized Cepstrum: cγ(m)

H(z) = s−1γ

⎛⎝ M∑

m=0

cγ(m) z−m

⎞⎠

=

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

⎛⎝1 + γ

M∑m=0

cγ(m) z−m

⎞⎠1/γ

, 0 < |γ| ≤ 1

expM∑

m=0

cγ(m) z−m, γ = 0

Inverse function of Generalized logarithm

s−1γ (w) =

⎧⎨⎩ (1 + γw)1/γ, 0 < |γ| ≤ 1

expw, γ = 0

16

Page 17: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Cost Function

E =1

∫ π

−π

{IN(ω)

P (ω)− log

IN(ω)

P (ω)− 1

}dω ⇒ min

Estimate of Power Spectrum

P (ω) = |H(ejω)|2 = σ2|D(ejω)|2

Interpretation in time-domain

ε = E[e2(n)

]⇒ min

1/D(z) �

x(n) e(n)

17

Page 18: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Advantage

−1 ≤ γ ≤ 0:

• Convex function ⇒ Global solutioncan easily be obtained

• The obtained system H(z) is minimum phase, e.g., stable

• γ = −1 ⇒ Linear Prediction

H(z) =1

1 −M∑

m=0

cγ(m)z−m

• γ = 0 ⇒ Cepstrum

H(z) = expM∑

m=0

cγ(m)z−m

18

Page 19: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Prediction Gain

• D(z) is minimum phase

• Gain of D(z) is one⇒

Predictor:

Q(z) =∞∑

k=1

a(k)z−k

Cost Function:

ε = E[e2(n)

]⇒ Prediction Gain:

G =E

[x2(n)

]E

[e2(n)

]

1/D(z) ��x(n) e(n)

⇓�����

Q(z)

��x(n) e(n)+−

19

Page 20: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of synthetic signal

(Generalized Cepstral Analysis)

0 2.5 5Frequency(Hz)

Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(a) Example 1

M=8

(LPC) (UELS)-1 -0.5 0

6

7

8

γPrediction gain(dB)

20

Page 21: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of synthetic signal

(Generalized Cepstral Analysis)

0 2.5 5Frequency(Hz)

Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(c) Example 3

M=8

(LPC) (UELS)-1 -0.5 0

3

4

5

γPrediction gain(dB)

21

Page 22: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of synthetic signal

(Generalized Cepstral Analysis)

0 2.5 5Frequency(Hz)

Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(b) Example 2

M=8

(LPC) (UELS)-1 -0.5 0

6

7

γPrediction gain(dB)

22

Page 23: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of natural speech

(Generalized Cepstral Analysis) /e/

-20

0

20

40

60

80

Magnitude(dB)

0 1 2 3 4 5Frequency(kHz)

γ = -1

0 1 2 3 4 5Frequency(kHz)

γ = -1/2

0 1 2 3 4 5Frequency(kHz)

γ = -1/3

0 1 2 3 4 5Frequency(kHz)

γ = 0

M=15

(LPC) (UELS)-1 -0.5 0

15

16

γ

Prediction gain(dB)

(a) male /e/

23

Page 24: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of natural speech

(Generalized Cepstral Analysis) /N/

-20

0

20

40

60

80

Magnitude(dB)

0 1 2 3 4 5Frequency(kHz)

γ = -1

0 1 2 3 4 5Frequency(kHz)

γ = -1/2

0 1 2 3 4 5Frequency(kHz)

γ = -1/3

0 1 2 3 4 5Frequency(kHz)

γ = 0

M=15

(LPC) (UELS)-1 -0.5 0

27

28

γ

Prediction gain(dB)

(b) male /N/

24

Page 25: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = −1/n)

inputσ

output1

C(z)

1

C(z)�

��

��

��

�� � � −−−− �1

C(z)�

1st 3nd . . .. . .. . . n-th

H(z) = σD(z) = σ

{1

C(z)

}n

C(z) =

⎛⎝1 + γ

M∑m=0

c′γ(m) z−m

⎞⎠

25

Page 26: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = 0)

—LMA filter

F (z) F (z) F (z) F (z)� � ���

��

��

��

��

��

��

��

+�Input �

� + �Output

AL,1 AL,2 AL,3 AL,4

- -

� � � �

���

���

����

���

���

����

D(z) = expF (z) RL(F (z)) =

1 +L∑

l=1

AL,l {F (z)}l

1 +L∑

l=1

AL,l {−F (z)}l

F (z) =M∑

m=1

cγ(m) z−m

26

Page 27: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Introduction of auditory frequency scale

First-order all-pass function:

z−1α = Ψ(z) =

z−1 − α

1 − αz−1

Phase Characteristics can be

used for Frequency

Transformation:

ω̃ = tan−1 (1 − α2) sinω

(1 + α2) cosω − 2α

where Ψ(ejω) = e−jω̃

0 π/2 π0

π /2

π

Frequency (rad)

War

ped

frequ

ency

(rad

) mel scale

10kHz samplingα = 0.35

ω

ω

27

Page 28: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Mel-Generalized Cepstral Analysis

Mel-generalized cepstrum: cα,γ(m)

H(z) = s−1γ

⎛⎝ M∑

m=0

cα,γ(m) z−mα

⎞⎠

=

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

⎛⎝1 + γ

M∑m=0

cα,γ(m) z−mα

⎞⎠1/γ

, 0 < |γ| ≤ 1

expM∑

m=0

cα,γ(m) z−mα , γ = 0

�z−1α =

z−1 − α

1 − αz−1

28

Page 29: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

• (α, γ) = (0, 0) ⇒ Cepstral model:

H(z) = expM∑

m=0

cα,γ(m)z−m

• (α, γ) = (0, −1) ⇒ AR model:

H(z) =1

1 −M∑

m=0

cα,γ(m)z−m

• (α, γ) = (0.35, 0) ⇒ Mel-cepstral model:

H(z) = expM∑

m=0

cα,γ(m) z−mα

• (α, γ) = (0.47, −1) ⇒ Warped AR model:

H(z) =1

1 −M∑

m=0

cα,γ(m) z−mα

29

Page 30: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

A Unified Approach to Speech Spectral Estimation

� |α| < 1, −1 ≤ γ ≤ 0

Mel-generalized cepstralanalysis

� α = 0

� γ = −1

� γ = 0

Generalized cepstral analysis

Linear prediction Warped Linear Prediction

Unbiased Cepstral analysis Mel-cepstral analysis

30

Page 31: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Mel-generalized analysis of natural speech /N/ M = 12

-40

0

40

Mag

nitu

de(d

B)

α = 0 α = 0.35 α = 0.47

-40

0

40

Mag

nitu

de(d

B)

0 2.5 5-40

0

40

Frequency(kHz)

Mag

nitu

de(d

B)

0 2.5 5

Frequency(kHz)

γ = 0

γ = − 0.5

γ = − 1

0 2.5 5

Frequency(kHz)

31

Page 32: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Example

α = 0

γ = −1

γ = −1/3

γ = 0

0 200 400 (ms)(a) Waveform

n a N b u d e w a

012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, -1, 12) LP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, -1/3, 12) GCEP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, 0, 12) UELS40(dB)(b) Spectral estimates ( α = 0) 32

Page 33: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Example

α = 0.35

γ = −1

γ = −1/3

γ = 0

0 200 400 (ms)(a) Waveform

n a N b u d e w a

012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, -1, 12) WLP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, -1/3, 12) MGCEP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, 0, 12) MCEP40(dB)(b) Spectral estimates ( α = 0.35) 33

Page 34: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = −1/n)

inputσ

output1

C(z)

1

C(z)�

��

��

��

�� � � −−−− �1

C(z)�

1st 2nd . . .. . .. . . n-th

Structure of H(z)

H(z) = σD(z) = σ

{1

C(z)

}n

C(z) =

(1 + γ

M∑m=0

c′α,γ(m) z−mα

) �+�Input

Output

z−1 �+� z−1 �+� z−1 �+� z−1�

��

��

�+�

b(1)

��

��

�+�

b(2)

��

�� b(3)

��

��

��

��

��

��

��

���

��

��

��

��

��

��

α

+�

��

��

α

+�

��

��

α

�−

�−

��

γ(1 − α2)

�−

� �

Structure of C(z) (M = 3)

34

Page 35: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = 0)

—MLSA filter

D(z) = expF (z) RL(F (z))

F (z) =M∑

m=0

c′α,γ(m) z−mα

• sufficient accuracy: maximum spectral error 0.24dB

• O(8M) multiply-add operations a sample

• guaranteed stability

• M multiply-add operations for filter coefficients calculation

35

Page 36: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of

MLSA filterF (z) F (z) F (z) F (z)� � �

��

��

��

��

��

��

��

��

+�Input �

� + �Output

AL,1 AL,2 AL,3 AL,4

- -

� � � �

���

���

����

���

���

����

RL(F (z)) expF (z) = D(z), L = 4

���

���

1 − α2

���

���

α ���

���

α ���

���

α

���

���b(1)

���

���b(2)

���

���b(3)

z−1 z−1 z−1 z−1

Input

� �+ � � �+ � � �+� � ���

��

��

���

��

��

��

��

��

��

��

���

�+-

��

��

��

��

��

��

��

���

�+-

��

��+ ���+ Output

Basic filter F (z), M = 3

36

Page 37: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

The choice of α, γ for speech analysis/synthesis

Analysis/synthesis system with fixed α and γ

• speech quality change with γ

γ → −1 Clear

γ → 0 Smooth

• When γ = 0, speech quality with (α, M) = (0.35,15) is

almost equivalent to that with (α, M) = (0,30).

• When the analysis order is high enough, the difference

becomes small.

37

Page 38: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Feature of Unified Approach

• Linear prediction analysis, Cepstral analysis are the special cases.

• Mathematically well-defined

• Physical interpretation

⇒ Minimization of energy of inverse filter output

⇒ x is Gaussian ⇒ Minimization of p(x|c) (ML estimation)

• Global solution, stability of the system function

• Synthesis filter for direct synthesis from the estimated coefficients

⇒ LMA/MLSA/GMSLA filter

• Extension to adaptive analysis (sample by sample basis)

• Parameter transformation for speech recognition

38

Page 39: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Word Recognition based on HMM

Spectral Analysis:1. (α1, γ1, M1) = (0, −1, 12) ⇒ Linear Prediction

2. (α1, γ1, M1) = (0.35, −1/3, 12) ⇒ Mel-generalized cepstral

analysis

3. (α1, γ1, M1) = (0.35, 0, 12) ⇒ Mel-cepstral analysis

Output vector of HMM:

(α2, γ2, M2) = (0.35, 0, 12)

Mel-cepstral coefficients

and Δ (dynamic coefficients)

H(z) = s−1γ1

⎛⎝ M1∑

m=0

cα1,γ1(m) z−mα1

⎞⎠ = s−1

γ2

⎛⎝ ∞∑

m=0

cα2,γ2(m) z−mα2

⎞⎠

39

Page 40: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Recognition Accuracy

(Continuous HMM, 33 phoneme models, 2618 words)

75 80Recognition accuracy (%)

Linear Prediction (α , γ , M )=1 1 1 (0, -1, 12) 77.5%

Mel-Generalized Cepstral Analysis (α , γ , M )=1 1 1 (0.35, -1/3, 12)

80.7%

Mel-Cepstral Analysis (α , γ , M )=1 1 1 (0.35, 0, 12)

79.6%

40

Page 41: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Application to 16kb/s wideband CELP coder

encoder CELPExcitationGenerator

SynthesisFilter

PerceptualWeightingFilter

MinimizeMSE

MGC Analysis

Quantizationof MGC Coef.

MGC coef.

MGC coef.

Input

decoder CELPExcitationGenerator

SynthesisFilter

MGC coef.

Postfilter

MGC coef.

Output

41

Page 42: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Speech quality as a function of α (γ = −1/2)

0.0 0.1 0.2 0.3 0.4 0.53.0

3.5

4.0

4.5

α

D M

O S

averagefemalemale

42

Page 43: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Subjective Evaluation

1 2 3 4 5D M O S

G.722 48kb/s

G.722 56kb/s

G.722 64kb/s

Conv.CELP16kb/s

MGC-CELP 16kb/s

MNRU(dB)

average female male

16 24 32 40 48

43

Page 44: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Summary

A unified approach to speech spectral estimation

• A unified approach toLinear predicton and Cepstral analysis

• Introduction of auditory frequency scale

• Efficint representation of speech spectrum with an

appropriate choice of α and γ

• Application to speech anaylysis/synthesis, speech

coding, speech recognition

Future work: Optimal α and γ

(Phoneme/Speaker dependent?)

Speech Signal Processing Toolkit:

http://kt-lab.ics.nitech.ac.jp/~tokuda/SPTK/

44