Top Banner
1 Smoothing Hidden Markov Models by Using an Adaptive Signal Limiter for Noisy Speech Recognition Wei-Wen Hung Department of Electrical Engineering Ming Chi Institute of Technology Taishan, 243, Taiwan, Republic of China E-mail : [email protected] FAX : 886-02-2903-6852; Tel. : 886-02-2906-0379 and Hsiao-Chuan Wang Department of Electrical Engineering National Tsing Hua University Hsinchu, 30043, Taiwan, Republic of China E-mail : [email protected] FAX : 886-03-571-5971; Tel. : 886-03-574-2587 Paper No. : 1033. (second review) Corresponding Author : Hsiao-Chuan Wang Key Words : hidden Markov model (HMM), hard limiter, adaptive signal limiter (ASL ), autocorrelation function, arcsin transformation.
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 129966864160453838[1]

1

Smoothing Hidden Markov Models by Using an

Adaptive Signal Limiter for Noisy Speech Recognition

Wei-Wen Hung

Department of Electrical Engineering Ming Chi Institute of Technology

Taishan, 243, Taiwan, Republic of China E-mail : [email protected]

FAX : 886-02-2903-6852; Tel. : 886-02-2906-0379

and

Hsiao-Chuan Wang

Department of Electrical Engineering National Tsing Hua University

Hsinchu, 30043, Taiwan, Republic of China E-mail : [email protected]

FAX : 886-03-571-5971; Tel. : 886-03-574-2587

Paper No. : 1033. (second review)

Corresponding Author : Hsiao-Chuan Wang

Key Words : hidden Markov model (HMM), hard limiter, adaptive signal limiter

(ASL), autocorrelation function, arcsin transformation.

Page 2: 129966864160453838[1]

2

Smoothing hidden Markov models by using an adaptive signal limiter for noisy speech recognition

Wei-Wen Hung and Hsiao-Chuan Wang

Department of Electrical Engineering, National Tsing Hua University

Hsinchu, 30043, Taiwan, Republic of China

Abstract. When a speech recognition system is deployed in the real world, environmental interference will

make noisy speech signals and reference models mismatched and cause serious degradation in recognition

accuracy. To deal with the effect of environmental mismatch, a family of signal limiters has been successfully

applied to a template-based DTW recognizer to reduce the variability of speech features in noisy conditions.

Though simulation results indicate that heavily smoothing can effectively reduce the variability of speech

features in low signal-to-noise ratio (SNR), it would also cause the loss of information in speech features.

Therefore, we suggest that the smoothing factor of a signal limiter should be related to SNR and adapted on a

frame by frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous

and dynamic spectral features of reference models and test speech. By smoothing spectral features, the

smoothed covariance matrices of reference models can be obtained by means of maximum likelihood (ML)

estimation. A speech recognition task for multispeaker isolated Mandarin digits has been conducted to

evaluate the effectiveness and robustness of the proposed method. Experimental results indicate that the

adaptive signal limiter can achieve significant improvement in noisy conditions and is more robust than the

hard limiter over a wider range of SNR values.

Key words. hidden Markov model (HMM), hard limiter, adaptive signal limiter (ASL), autocorrelation

function, arcsin transformation.

This research has been partially sponsored by the National Science Council, Taiwan, ROC, under contract

number NSC-88-2614-E-007-002.

Page 3: 129966864160453838[1]

3

LIST OF FIGURES AND TABLES

Fig. 1 Block diagram for implementing a speech recognizer with adaptive signal limiter.

Fig. 2 The various LPC log magnitude spectra of utterance ‘1’ in clean condition.

(a) LPC log magnitude spectra without signal limiter.

(b) LPC log magnitude spectra with hard limiter.

(c) LPC log magnitude spectra with adaptive signal limiter.

(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)

Fig. 3 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise.

(a) LPC log magnitude spectra without signal limiter.

(b) LPC log magnitude spectra with hard limiter.

(c) LPC log magnitude spectra with adaptive signal limiter.

(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)

Fig. 4 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise.

(a) LPC log magnitude spectra without signal limiter.

(b) LPC log magnitude spectra with hard limiter.

(c) LPC log magnitude spectra with adaptive signal limiter.

(δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)

Fig. 5 The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise. (a) Comparison of average log likelihoods without signal limiter.

(b) Comparison of average log likelihoods with hard limiter.

(c) Comparison of average log likelihoods with adaptive signal limiter.

(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)

Fig. 6 The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise. (a) Comparison of average log likelihoods without signal limiter.

(b) Comparison of average log likelihoods with hard limiter.

(c) Comparison of average log likelihoods with adaptive signal limiter.

(δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)

Table 1. Comparison of digit recognition rates (%) for white noise.

(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)

Table 2. Comparison of digit recognition rates (%) for factory noise.

Page 4: 129966864160453838[1]

4

(δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)

Table 3. Comparison of digit recognition rates (%) for F16 noise.

(δmin .=00, δmax .=10, SNR dBLB=15 , SNR dBUB=35 .)

Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer.

Page 5: 129966864160453838[1]

5

1. Introduction

When a speech recognition system trained in a well-defined environment is used in the real world

applications, the acoustic mismatch between training and testing environments will degrade its recognition

accuracy severely. This acoustic mismatch is mainly caused by a wide variety of distortion sources, such

as ambient additive noises, channel effect and speaker’s Lombard effect. During the past several decades,

researchers focused their attentions in dealing with the mismatch problem and tried to narrow the

mismatch gap. There are many algorithms have been proposed and successfully applied for robust

speech recognition. Generally speaking, the methods for handling noisy speech recognition could be

roughly classified into the following approaches (Sankar and Lee, 1996). The first approach tries to

minimize the distance measures between reference models and testing signals by adaptively adjusting

speech signals in feature space. For example, Mansour and Juang (Mansour and Juang, 1989) found that

the norm of a cepstral vector is shrunk under noise contamination. Therefore, they used a first-order

equalization method to adapt the cepstral means of reference models so that the shrinkage of speech

features can be adequately compensated. Likewise, Carlson and Clement (Carlson and Clement, 1994)

also proposed a weighted projection measure (WPM) for recognition of noisy speech in the framework

of continuous density hidden Markov model (CDHMM). In addition, the norm shrinkage of cepstral

means will also lead to the reduction of HMM covariance matrices. Thus, Chien et al., (Chien, 1997a;

Chien et al., 1997b) proposed a variance adapted and mean compensated likelihood measure

(VA-MCLM) to adapt the mean vector and covariance matrix simultaneously.

The second approach estimates a transformation function in model space for transforming reference

models into testing environment and thus the environmental mismatch gap can be effectively reduced. In

the literature, there were a number of techniques compensating ambient noise effect in model space.

Among them, one of the most promising techniques is the so-called parallel model combination (PMC).

In the PMC algorithm, Varga and Moore (Varga and Moore, 1992a) adapted the statistics of reference

Page 6: 129966864160453838[1]

6

models to meet the testing conditions by optimally combining the reference models and noise model in

linear spectral domain. In the later few years, several related works have been successively reported for

improving the performance of PMC method. Flores and Young (Flores and Young, 1992) integrated

spectral subtraction (SS) and PMC methods to seek for further improvement in recognition accuracy. In

addition, Gales and Young (Gales and Young, 1995) extended PMC scheme to include the effect of

convolutional noise.

In the third approach, a more robust feature representation is developed in signal space so that the

speech feature is invariant or less susceptible to environmental variations. In this approach, Lee and Lin

(Lee and Lin, 1993) developed a family of signal limiters as a preprocessor to smooth speech signals.

When a speech signal is passed through a signal limiter with zero smoothing factor (i.e., a hard limiter),

the hard limiting operation preserves the sign of an input speech signal and ignores its magnitude. Thus,

the hard-limited speech signal is only affected by ambient noises when the signal-to-noise ratio (SNR) is

relatively low. This smoothing process for feature vectors has been shown to be effective for reducing the

variability of feature vectors in a noisy environment and make them less affected by ambient noises over a

wide range of SNR values. Experimental results for recognition of 39-word alpha-digit vocabulary also

demonstrate that an equivalent gain of 5-7 dB in SNR can be achieved for a template-based DTW

recognizer.

However, from the experimental results reported by Lee and Lin (Lee and Lin, 1993), we can also

observe that the recognition accuracy using a hard limiter for clean speech becomes worse. This

phenomenon may be explained as follows. For an utterance, the amplitudes of unvoiced segments are

generally much lower than the amplitudes of voiced segments. Heavily smoothing can reduce feature

variability of the speech segments with low SNR, but it also causes the loss of some important

informations embedded in the clean segments and the segments with high SNR. Therefore, a signal limiter

with fixed smoothing factor might not work well for the all segments in a speech utterance. We suggest

Page 7: 129966864160453838[1]

7

that the smoothing factor of a signal limiter should be related to SNR value and adapted on a frame by

frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous and

dynamic spectral features of hidden Markov models (HMM) and testing speech signals. In addition, in

order to moderately reflect the variation of model covariance due to application of signal limiting

operation to the state statistics of word models, the adaptation of covariance matrix is also performed in

the sense of maximum likelihood (ML) estimation.

The layout of this paper is as follows. In the subsequent section, we describe the detailed formulation

of the proposed adaptive signal limiter and its extension to the framework of a continuous density hidden

Markov model. In Section 3, we investigate the behavior of LPC spectra of a speech utterance and its

signal-limited version under the influence of various ambient noises. In addition, a series of experiments

were conducted to compare the discriminability of different signal limiters in various noisy conditions.

Some experiments for recognition of multispeaker isolated Mandarin digits were performed in Section 4

to evaluate the effectiveness and robustness of the proposed method in presence of ambient noises.

Finally, a conclusion is drawn in Section 5.

2. Smoothing hidden Markov models by using an adaptive signal limiter

In this section, we describe the detailed formulation of the proposed adaptive signal limiter (ASL) and

its extension to the framework of an HMM-based speech recognizer.

2.1 Representation of the underlying hidden Markov models

Conventionally, for a continuous density hidden Markov model (CDHMM), the output likelihood

measure of tth− frame in the testing utterance { }Y y ctdt t Tyt= = ≤ ≤[ , ],1 based on the

statistics of i th− state of word model { }Λ Λ Σ( ) ( , ),, , ,

w i Sw i w i w i w= = ≤ ≤µ 1 can be

Page 8: 129966864160453838[1]

8

characterized by a multivariate Gaussian probability density function (pdf) and formulated as

py p y yt w i w i t w iT

w i t w i( ) ( ) exp ( ) ( )

, , , , ,Λ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ −

−−

21

2

121π µ µ , (1)

where µw i w i w ic d, , ,[ , ]= denotes the mean vector of i th− state of word model Λ( )w and

consists of p− order cepstral vector cw i, and p− order delta cepstral vector dw i,. Σw i, denotes

the covariance matrix of i th− state of word model Λ( )w and is simplified as a diagonal matrix, i.e.,

Σw i w i w i w idiag p, , , ,[ () () ( )]= ⋅ ⋅ ⋅ ⋅σ σ σ2 2 21 2 2 . However, in order to adequately reflect the

variation of dynamic spectral features due to application of a signal limiting operation to instantaneous

spectral features, the representation of state statistics in a conventional hidden Markov model is modified

slightly. In our approach, the mean vector µw i w i w ic d, , ,[ , ]= of i th− state of the word model

Λ( )w is indirectly represented by the normalized autocorrelation vectors of a five-frame context

window (Lee and Wang, 1995), that is [ , , , , ],, ,, ,, ,, ,,r r r r rw i w i w i w i w i− −2 1 0 1 2 , where

r r r pw ij w ij w ijT

,, ,, ,,[ (),, ( )]= ⋅⋅ ⋅1 , j=0 denotes the instantaneous frame, j=-1, -2 the left context

frames and j=1, 2 the right context frames. The estimation of those normalized autocorrelation vectors

in a five-frame context window is proceeded as follows. Firstly, a conventional hidden Markov model is

trained for each word by means of the segmental k-means algorithm. Then, based upon the obtained

word models, each frame in the training utterances is labeled with its decoded state identity by using the

Viterbi decoding algorithm. Those instantaneous, left-context and right-context autocorrelation vectors

corresponding to the same state identity are collected and averaged to obtain the indirect representation

of the underlying hidden Markov models. For example, the normalized autocorrelation vectors of i th−

state of the word model Λ( )w can be formulated by

[ , , , , ]

[ , , , , ]

,,, ,, ,, ,,1 ,,

, , , , ,,r r r r r

r r r r r

Nw i w i wi w i w i

wtu

wtu

wtu

wtu

wtu

ut

s− −

− − + +

=∑

2 1 0 2

2 1 1 2

(2)

Page 9: 129966864160453838[1]

9

where rw tu, represents the normalized autocorrelation vector of the u th− training utterance, t th−

frame of word w. Above summation includes all the Ns frames which are labeled with the state

identity i of word model Λ( )w .

Based upon this indirect representation, the analysis equations of linear predictive coding (LPC) model

can be expressed in matrix form as

R a rw ij w ij w ij,, ,, ,,⋅ = forj= − ⋅⋅ ⋅2 2,,, (3)

where Rw ij,, is an autocorrelation matrix of the form

R

r r r p

r r r p

r p r p r

w ij

w ij w ij w ij

w ij w ij w ij

w ij w ij w ij

,,

,, ,, ,,

,, ,, ,,

,, ,, ,,

() () ( )

() () ( )

( ) ( ) ()

=

−−

− −

0 1 1

1 0 2

1 2 0

LL

M M O ML

. (4)

Since the autocorrelation matrix is Toeplitz symmetric and positive definite, the LPC coefficient vector

a a a a pw ij w ij w ij w ijT

,, ,, ,, ,,[ () () ()]= 1 2 L can be solved efficiently by the Levinson-Durbin

recursion method (Rabiner and Juang, 1993). Once we obtain the LPC coefficient vector for Eq. (3), the

corresponding cepstral vector cw ij,, can be recursively calculated by using the LPC to cepstral

coefficient conversion formula

c m a m kmc k a m kw ij w ij w ij w ij

k

m

,, ,, ,, ,,( ) ( ) ( ) ( ) ( )= + ⋅ ⋅ −=

∑1

1

, 1≤ ≤m p. (5)

Finally, the cepstral vector of instantaneous frame, i.e., cw ij,, forj=0, is used as the mean vector,

cw i,, of i th− state of word model Λ( )w . In addition, the corresponding delta cepstral vector

dw i, can also be calculated by using the following equation :

d

jc

jw i

w ijj

j

j

j,

,,

=

⋅=−

=

=−

=

2

2

2

2

2. (6)

Page 10: 129966864160453838[1]

10

2.2 Formulation of the adaptive signal limiter

For recognition of noisy speech, it has been observed that employing a signal limiter to smooth a

speech signal in time domain leads to significant performance improvement. The basic theory of a signal

limiter can be roughly described as follows (Lee and Lin, 1993). When a signal x is passed through a

signal limiter, the signal limiting operation is equivalent to performing a nonlinear transformation on the

input signal and so that the corresponding output signal y can be essentially characterized by an error

function of the form :

y sxK

t dtx

= =⋅ ⋅

⋅ − ⋅∫() exp[( )] ,2

22

2 2

0π σσ (7)

where K is a scaling constant and σ 2 is a tunable factor for adjusting the smoothing degree of a

signal limiting operation.

In light of above pronounced smoothing property, a signal limiter can be readily extended to the

processing of speech signals in a noisy environment. Consider an input speech signal x, approximated

by a zero mean, stationary Gaussian process with variance σx2, has the density function as

gxx

x x

() exp .=⋅ ⋅

⋅ −⋅

1

2 22

2

2π σ σ

(8)

Then, the output y of a signal limiter has the density function expressed as (see Appendix A)

( ) ,2

1exp))(()( 2

2

⋅⋅⋅−−⋅==

x

xK

xshyhσδ

δδ

(9)

where x s y= −1( ). δ denotes the smoothing factor of a signal limiter and is defined as δ σ σ= 2 2x.

The larger the value of δ , the smaller the value of output signal y. When the smoothing factor δ

approaches 0, the corresponding signal limiter changes into a hard limiter of the form

y fxK ifx

ifxK ifx

= =>=

− <

()

2 00 0

2 0. (10)

A signal limiting operation can also be interpreted as an arcsin transformation in autocorrelation domain.

Assume that the autocorrelation functions of input speech signal x and its signal-limited output y are

denoted as rx()τ and ry()τ , respectively. Then, the normalized autocorrelation function of a

signal-limited output y can be formulated as (Lee and Lin, 1993)

Page 11: 129966864160453838[1]

11

[ ]

[ ]r

rr

r

r

ry

y

y

x()()

()

sin ()( )

sin( )τ

τ τ δ

δ≡ =

+

+

−0

1

11

1

1, (11)

where rr r rx x x() () ()τ τ≡ 0 is the normalized autocorrelation function of the input speech signal x.

By properly adjusting the smoothing factor δ , various degrees of smoothing effect can be obtained.

When δ approaches infinity, the normalized autocorrelation function of an input speech signal rrx()τ

is almost equal to the normalized autocorrelation function of the corresponding signal-limited output rry()τ . Furthermore, in the case of δ = 0, the normalized autocorrelation function of the signal-limited

output rry()τ is reduced to the following equation (see Appendix B) :

[ ]r rr ry x() sin ().τ

πτ= ⋅ −2 1 (12)

In the literature presented by Lee and Lin (Lee and Lin, 1993), they used a hard limiter as a

pre-processor to reduce the variability of feature vectors in noisy conditions. That is, a pre-determined

smoothing factor is used throughout a speech signal. However, it is known that the segments of clean

speech with less energy are influenced most by ambient noises and thus require heavily smoothing. As to

the clean segments and the segments with high SNR, excessively smoothing not only destroys their

distinct features but also reduces the discriminability of speech features in a noisy environment. Therefore,

we propose an adaptive signal limiter (ASL) in which the smoothing factor δ is related to SNR and

adapted on a frame by frame basis. In the proposed adaptive signal limiter, the smoothing factor δ is

empirically formulated as :

( )δ

δ

δ δδ

δ

( ) ,

min

max minmin

max

SNR

ifSNRSNR

SNR SNRSNRSNR ifSNR SNRSNR

ifSNRSNR

LB

UB LBLB LB UB

UB

=

<

−−

⋅ − + ≤ ≤

>

(13)

and SNRE

Es

n

≡ ⋅10 10log( ), (14)

where δ δmin max, , ,SNR SNRLB UB are tuning constants, Es is the frame energy of a clean speech

signal and En is the noise energy. In the subsequent experiments, the arcsin transformation shown in

Eq.(11)-(14) are used to compute the normalized autocorrelation of a signal-limited signal rather than

directly applying the nonlinear operation of Eq. (7) on the input signal. This is because of that the

underlying hidden Markov models are indirectly represented by the LPC-based spectral features. The

LPC spectral features can be efficiently calculated from autocorrelation function by means of Eq.(5).

Page 12: 129966864160453838[1]

12

Moreover, comparing with the signal limiting operation shown in Eq.(7), the arcsin transformation

requires less computation cost.

2.3 Adaptations of dynamic spectral feature and covariance matrix

When a signal limiting operation is performed on the autocorrelation function of a speech signal, it not

only smooths instantaneous spectral vectors but also leads to reduction of the corresponding dynamic

spectral features and model covariance matrices. Therefore, in order to achieve higher consistency, the

adaptations of a model’s dynamic spectral features and its covariance matrices are necessary. This

adaptation procedure is proceeded as follows. When the t th− frame yt of a testing utterance Y is

evaluated on the state Λw i, , the cepstral vectors ctj, of its context frames ytj, , for j− ≤ ≤2 2,

are first transformed to give the corresponding normalized autocorrelation vectors rtj, . Then, those

normalized autocorrelation vectors r r r ptj tj tjT

, , ,[ (),, ()]= ⋅ ⋅ ⋅1 are processed by the following

arcsin transformation :

[ ]

[ ]

~ ()

sin()

( )

sin( )

,,

,

,

,

r

r

SNR

SNR

tj

tj

tj

tj

τ

τ

δ

δ

=+

+

1

1

1

1

1

for − ≤ ≤2 2j and 1≤ ≤τ p. (15)

In above equation, the SNRtj, variable is determined by

SNRE E

Etjtj n

n, log

( ),= ⋅

−10 10 (16)

where Et is the tth− frame energy in the testing utterance Y . En is the noise energy and can be

roughly estimated by selecting the lowest energy in the testing utterance Y , i.e.,

{ }E E E En Ty= ⋅⋅ ⋅min, ,,1 2 . Once the smoothed autocorrelation vectors ~ ,,r for jtj − ≤ ≤2 2, were

obtained, the smoothed testing cepstral vector ~,ctj of ~,ytj can be calculated by means of the LPC to

cepstrum conversion formula. Moreover, the corresponding smoothed testing delta cepstral vector ~dt

can also be solved by using the following equation :

~

~,

d

jc

jt

tjj

j

j

j=

⋅=−

=

=−

=

∑2

2

2

2

2, (17)

Page 13: 129966864160453838[1]

13

and thus, the smoothed testing feature vector ~ [~,~]y cdt t t= can be taken as the term ~ [~ ,

~], ,y c dt t t0 0= .

Similarly, in order to avoid introducing mismatch between testing speech signals and reference models,

the mean vector of state Λw i, should be also smoothed by using Eq. (11) with the same smoothing

factor, and thus its smoothed version ~ [~ ,~], , ,µw i w i w ic d= can be obtained. On the other hand, by

substituting ~ [~ ,~], , ,µw i w i w ic d= and ~ [~,

~]y cdt t t= into Eq.(1), we may obtain

~(~(~ , ))( ) exp (~ ~ ) (~ ~ ), , , , , ,

py p y yt w i w i w i t w iT

w i t w iµ π µ µΣ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ −

− −

21

2

12 1 . (18)

By taking differential of logarithm of Eq. (18) with respect to Σw i, and setting the result to zero, we can

obtain the optimal smoothed covariance matrix ~,Σw i which maximize the likelihood function in Eq.(18),

that is (see Appendix C)

~

~( )~ ( )

( )

~( )

~( )

( ).,

,

,

,

,

,Σ Σw i

t w i

w i

t w i

w im

m p

m

m p

w i

c m c m

m

d m d m

p m

p=

+

+

⋅⋅

=

=

=

=

∑∑σ σ

2

11

2

2 (19)

Finally, the resulting smoothed output likelihood measure can be rewritten as :

~(~~ ) ( )~

exp (~ ~ )~

(~ ~ ), , , , ,

py p y yt w i w i t w iT

w i t w iΛ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ −

− −

21

2

12 1π µ µ . (20)

2.4 Implementation of a speech recognizer with adaptive signal limiter

To be more detailed, the overall system diagram for implementing a HMM-based speech recognizer

with adaptive signal limiter is depicted in Fig. 1. In the training phase, we first train a set of word models

by using the segmental k-means algorithm and Viterbi decoding method (Juang and Rabiner, 1990). Also,

the state statistics of a word model are indirectly represented by the normalized autocorrelation vectors of

a five-frame context window. When a testing utterance Y is to be recognized, we first use Eq.(15) and

Eq.(16) to estimate the frame-dependent smoothing factor and perform arcsin transformation on the

normalized autocorrelation vectors rtj, . When the arcsin-transformed vectors ~,rtj are obtained, we

can solve the smoothed cepstral vector ~,ctj and its delta cepstral vector by LPC to cepstrum

conversion formula and Eq. (17). Moreover, the same smoothing factor is also used to smooth the state

Page 14: 129966864160453838[1]

14

statistics of word models. Once the smoothed autocorrelation vectors ~,,rw ij are obtained, the

smoothed cepstral vectors ~,,cw ij can likewise be calculated by means of the LPC to cepstrum

conversion formula. Moreover, the corresponding smoothed delta cepstral vector ~,dw i and covariance

matrix ~,Σwi can also be solved by using the Eq.(6) and Eq.(19). Finally, by substituting ~yt,

~,µw i and

~,Σwi into Eq. (20), we can obtain the smoothed output likelihoods.

(Figure 1 is about here.)

3. Effectiveness and robustness of the adaptive signal limiter

3.1 Database and experimental conditions

A multispeaker (50 male and 50 female speakers) isolated Mandarin digit recognition (Lee and Wang,

1994) was conducted to demonstrate the effectiveness and robustness of the proposed adaptive signal

limiter. There are three sessions of data collection in the digit database. For each session, every speaker

uttered a set of 10 Mandarin digits. Speech signals are sampled at 8 KHz. Each frame contains 256

samples with 128 samples overlapped, and is multiplied by a 256-point Hamming window. Endpoints are

not detected so that each utterance still contains about 0.1~0.5 seconds of pre-silence and post-silence.

Each digit is modeled as a left-to-right HMM without jumps in which the output of each state is a

2-mixture Gaussian distribution of feature vectors. Each word model contains seven to nine states

including pre-silence and post-silence states. The feature vector is indirectly represented by the 12-order

normalized autocorrelation vectors of a five-frame context window. This representation can be then

transformed into a 12-order cepstral vectors and a 12-order delta cepstral vector. Moreover,

NOISEX-92 noise database (Varga et al., 1992b) was used for generating noisy speech. The

subsequent experiments were conducted to examine the following problems : (1) influence of signal

limiters on the LPC spectra of clean speech, (2) influence of signal limiters on the LPC spectra of noisy

speech, and (3) effects of signal limiters on speech discriminability in a noisy environment.

Page 15: 129966864160453838[1]

15

3.2 Influence of signal limiters on LPC spectra of clean speech

A sample utterance of Mandarin digit ‘1’ uttered by a male speaker is used to demonstrate the

influence of signal limiters on LPC spectra of clean speech. The 12-order LPC spectrum analysis is

performed on a 32 msec window with 16 msec frame shift. To observe the spectral variation in frequency

domain, we ploted the LPC spectra of 15 consecutive frames extracted from the middle portion of the

sample utterance. Figure 2 shows the log LPC spectra of the sample utterance ‘1’ in the cases of without

signal limiter, with hard limiter and with adaptive signal limiter. From this figure, we can observe that the

formants of utterance ‘1’ occur about at the positions of 200Hz, 1950Hz, 3100 Hz and 3350Hz. After

applying a signal limiter, it is noted that parts of the original spectra become more smoothed and their

formant peaks are broaden. Especially, in the case of using a hard limiter, the second, third and fourth

formants are severely suppressed. Since the location and spacing of formant frequencies are highly

correlated with the shape of a vocal tract, this suppression will reduce the discriminability of speech

utterances and lead to misrecognition. On the other hand, we can also find that the spectral shape in the

case of using the adaptive signal limiter is almost unaffected. This is mainly due to that an adaptive signal

limiter employing larger smoothing factor is useful to keep the arcsin-transformed autocorrelation function

almost unchanged in clean condition.

(Figure 2 is about here.)

3.3 Influence of signal limiters on LPC spectra of noisy speech

In this subsection, we explore the influence of signal limiters on LPC spectra of noisy speech. This is

shown in Fig. 3 and Fig. 4, where we plot the LPC spectra of the same utterance shown in Fig. 2 with 20

dB additive white Gaussian noise and factory noise, respectively. When a white noise is added to clean

speech, there gradually appears an abnormal formant peak in the LPC spectra of distorted utterance ‘1’

at about 1125 Hz ~ 1625 Hz as shown in Fig. 3 (a). This phenomenon also happens in the case of

Page 16: 129966864160453838[1]

16

adding a factory noise to clean speech. In the case of adding factory noise, the abnormal formant peak

occurs at about 1000 Hz ~ 1375 Hz. However, comparing with the baseline case, the spectral distortion

in the LPC spectra with signal limiter are less pronounced. This property verifies the robustness of signal

limiters in a noisy environment. In addition, a comparison of Fig. 3, Fig. 4 and Fig. 2 shows that

excessively smoothing autocorrelation function will suppress parts of formant peaks and lose some

important informations about the shape of a vocal tract. Instead of using a fixed smoothing factor, an

adaptive signal limiter adaptively adjusting the degree of smoothness can not only effectively reduce the

variability of speech features, but also preserve more useful spectral information embedded in a speech

signal.

(Figure 3 and figure 4 are about here.)

3.4 Effects of signal limiters on speech discriminability in a noisy environment

In this subsection, we evaluate the robustness of signal limiters in noisy conditions. First, the first two

sessions of database were used to train a set of word models by using the segmental k-means algorithm.

To generate noisy speech, white Gaussian noise and factory noise were separately added to the 100

utterances of Mandarin digit ‘1’ in the third session. Those distorted utterances were then evaluated on

the 10 word models to obtain maximum log likelihoods. For each word model, we can find the average

log likelihoods by averaging the accumulation of all log likelihoods corresponding to the same word

model. In Fig. 5 and Fig. 6, we plot the average log likelihoods of utterance ‘1’ as a function of SNR

values in the cases of white Gaussian noise and factory noise, respectively. When the underlying

environment is getting noisy, i.e., below a SNR threshold, utterance ‘1’ is easily misrecognized as

utterance ‘7’. For white noise, the SNR thresholds occur at about 20 dB, 15 dB and 7 dB for the cases

of without signal limiter, with hard limiter and with adaptive signal limiter, respectively. Similarly, for

factory noise, the SNR thresholds occur at about 15 dB, 10 dB and 3 dB for the cases of without signal

Page 17: 129966864160453838[1]

17

limiter, with hard limiter and with adaptive signal limiter, respectively. These experimental results reveal

that an equivalent gain of about 12 ~ 13 dB and 7 ~ 8 dB in SNR can be achieved when the adaptive

signal limiter is compared with the baseline and hard limiter for recognition of utterance ‘1’ in noisy

conditions, respectively.

(Figure 5 and figure 6 are about here.)

4. Experimental results and discussion

In this section, a multispeaker (50 males and 50 females) recognition of isolated Mandarin digits (Lee

and Wang, 1994) was conducted to demonstrate the merits of the proposed method. The experimental

setup and underlying database have been described in subsection 3.1. In the experiments we conducted,

a conventional hidden Markov model without incorporating any signal limiters is referred as a baseline

system. The ambient noises including white Gaussian noise, F16 noise and factory noise were separately

added to clean speech with predetermined SNRs at 20, 15, 10, 5 and 0 dB to generate various noisy

speech signals. Moreover, the parameters used in the proposed adaptive signal limiter under different

noisy conditions are determined empirically as follows. Firstly, the smoothing factor δ is initially set to

0 and increased with increment ∆δ = 01. while SNRLB and SNRUB are kept constant. It is

observed that when the smoothing factor δ is beyond 1, smoothing operation has little effect on digit

recognition rates. This phenomenon also happens in the cases of using different sets of parameters

SNRLB and SNRUB. Therefore, the maximum value of smoothing factor can be well approximated by

setting δmax .=10 and employed throughout all experiments. Similarly, we chose a SNR lower bound

from the interval 030~ dB while a SNR upper bound from the interval 2050~ dB with increment

5dB to test which set of SNR parameters can achieve better digit recognition accuracy.

In Table 1, we assess the recognition accuracy of baseline, parallel model combination (PMC),

Page 18: 129966864160453838[1]

18

baseline with hard limiter and baseline with adaptive signal limiter for recognition of noisy speech under

the influence of a white noise. From the experimental results, we can find that the baseline with hard

limiter improves the recognition accuracy at low SNR and performs worse at high SNR and clean

condition. This is mainly because that oversmoothing autocorrelation function severely distorts some

important spectral informations embedded in original speech signals. On the other hand, the improvement

of the proposed adaptive signal limiter is remarkable due to the helps of adaptively adjusting the

smoothing factor. The adaptive signal limiter further outperforms the hard limiter. This means that using

larger smoothing factors for clean condition and high SNR is as important as using smaller smoothing

factors for low SNR.

(Table 1 is about here.)

Moreover, we also find that the PMC method is superior to the proposed adaptive signal limiter

technique in recognition accuracy. This superiority is mainly due to that the PMC method decomposes

the concurrent processes of speech and background noise, and so that the environmental mismatch can

be effectively reduced by optimally combining those two processes in the linear spectral domain. In

contrast, the environmental mismatch is not compensated during the signal limiting operation. The

proposed adaptive signal limiter can be considered as a weighting function which neglects the speech

segments with low SNR by heavily smoothing their features in autocorrelation domain. This smoothing

operation not only reduces feature variability in noisy conditions but also inevitably deteriorates parts of

characteristics of speech features. Therefore, it is intuitive that the PMC method has better recognition

accuracy as comparing with the proposed method. However, those comparison results do not indicate

that the proposed method is useless for noisy speech recognition. For the segments with low SNR (e.g.,

distorted unvoiced segment), the adaptive signal limiter seems to be more effective than the PMC method

in some noisy conditions. This implies that model adaptation is useful for high and medium SNRs while

feature smoothing is more feasible for low SNR. As described in the paper proposed by C. H. Lee and

Page 19: 129966864160453838[1]

19

C. H. Lin (Lee and Lin, 1993), a signal limiter can be combined with other noise-robust speech

recognition techniques to obtain an additional performance improvement. Therefore, it is expected that by

properly integrating the adaptive signal limiter with other noise-robust speech recognition techniques, such

as WPM, PMC methods, further improvement in recognition accuracy could be obtained.

Likewise, the comparison of different methods in the presence of factory noise and F16 noise are also

illustrated in Table 2 and Table 3, respectively. We can observe that the proposed method consistently

achieves remarkable improvement in recognition accuracy. This result verifies the effectiveness and

robustness of the adaptive signal limiter for speech recognition in white noise as well as colored noises.

As far as the computation time is concerned, the adaptive signal limiter needs fewer computation time

than the PMC method. The reduction in CPU time is about 25%. A detail of CPU time for different

methods is shown in the Table 4.

(Table 2, Table 3, and Table 4 are about here.)

5. Conclusion

In this paper, we explore the influence of a hard limiter on LPC spectra of clean and noisy speech. It is

found that excessively smoothing in autocorrelation domain of a speech signal will suppress parts of

formant peaks and reduce the discriminability of speech features in noisy conditions. Based upon the

weakness of a hard limiter, an adaptive signal limiter is proposed to improve its robustness. In our

approach, the smoothing degree of a signal limiter is related to SNR value and adaptively determined on

a frame by frame basis. That is, the smaller the SNR value of a speech frame, the smaller the smoothing

factor of a signal limiter. Experimental results verify that an adaptive signal limiter outperforms a hard

limiter at various SNRs. This improvement is mainly due to that an adaptive signal limiter not only reduces

feature’s variability in low SNR, but also preserves some important informations bearing in the speech

segments with high SNR.

Page 20: 129966864160453838[1]

20

Acknowledgement

The authors would like to thank Dr. Lee-Min Lee of Mingchi Institute of Technology, Taipei, Taiwan,

for his enthusiasm in providing experiences for implementing the new representation of hidden Markov

model with five-frame context window.

References

Carlson, B. A., Clement, M. A., 1994. A projection-based likelihood measure for speech recognitionin

noise. IEEE Trans. on Speech and Audio Processing. Vol. 2, pp. 97-102.

Chien, J. T., 1997a. Speech recognition under telephone environments. Ph.D. Thesis. Department of

Electrical Engineering, National Tsing Hua University, Taiwan, R.O.C.

Chien, J. T., Lee, L. M., Wang, H. C., 1997b. Extended studies on projection-based likelihood measure

for noisy speech recognition. revised in IEEE Trans. on Speech and Audio Processing.

Flores, J. A. N., Young, S. J., 1992. Continuous speech recognition in noise using spectral subtraction

and HMM adaptation. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). San

Francisco. Vol. 1, pp. 409-412.

Gales, M. J. F., Young, S. J., 1995. Robust speech recognition in additive and convolutional noise using

parallel model combination. Computer Speech and Language, Vol. 4, pp. 352-359.

Juang, B. H., Rabiner, L. R., 1990. The segmental k-means algorithm for estimating parameters of

hidden Markov models. IEEE Trans. on Acoustics, Speech, Signal Proc., 38(9) : 1639-1641,

September.

Lee, C. H., Lin, C. H., 1993. On the use of a family of signal limiters for recognition of noisy speech.

Speech Communication, Vol. 12, pp. 383-392.

Lee, L. M., and Wang, H. C., 1994. A study on adaptation of cepstral and delta cepstral coefficients for

Page 21: 129966864160453838[1]

21

noisy speech recognition. Proc. of Int. Conf. Spoken Language Processing (ICSLP). Yokohama,

Japan. pp. 1011-1014.

Lee, L. M., Wang, H. C., 1995. Representation of hidden Markov model for noise adaptive speech

recognition. Electronics Letters, Vol. 31, No. 8, pp. 616-617.

Mansour, D., Juang, B. H., 1989. A family of distortion measures based upon projection operation for

robust speech recognition. IEEE Trans. on Acoustics, Speech, Sig5nal Processing, Vol. 37, pp.

1659-1671.

Rabiner, L., and Juang, B. H., 1993. Fundamentals of Speech Recognition, Englewood Cliffs, New

Jersey, Prentice-Hall, pp. 112-117.

Sankar, A., Lee, C. H., 1996. A maximum-likelihood approach to stochastic matching for robust speech

recognition. IEEE Trans. on Speech and Audio Processing, Vol. 4, pp. 190-202.

Varga, A. P., Moore, R. K., 1992a. Hidden Markov model decomposition of speech and noise. IEEE

Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), San Francisco. pp. 845-848.

Varga, A. P., Steeneken, H.J.M., Tomlinson, M., Jones, D., 1992b. The NOISEX-92 study on the

effect of additive noise on automatic speech recognition, Technical Report, DRA Speech Research

Unit, Malvern, England.

Page 22: 129966864160453838[1]

22

Table 1. Comparison of digit recognition rates (%) for white noise. ( 0.0min =δ , 0.1max =δ , SNR dBLB=20 , SNR dBUB=30 .)

SNRs Methods

clean 20 dB 15 dB 10 dB 5 dB 0 dB

baseline 98.9 80.2 65.7 48.8 25.6 10.6 PMC 98.7 92.2 84.6 72.7 59.3 47.1

hard limiter 90.6 76.8 68.5 55.8 35.9 21.4 adaptive limiter 95.2 85.1 76.4 68.1 58.5 49.7

Table 2. Comparison of digit recognition rates (%) for factory noise. ( 0.0min =δ , 0.1max =δ , dBSNRLB 10= , dBSNRUB 40= .)

SNRs Methods

clean 20 dB 15 dB 10 dB 5 dB 0 dB

baseline 98.9 91.2 81.4 65.9 46.9 25.4 PMC 98.7 95.0 91.8 82.3 73.2 52.5

hard limiter 90.6 86.3 80.2 71.3 57.5 30.0 adaptive limiter 94.9 91.9 87.8 77.7 69.2 53.3

Table 3. Comparison of digit recognition rates (%) for F16 noise. ( 0.0min =δ , 0.1max =δ , SNR dBLB=15 , SNR dBUB=35 .)

SNRs Methods

clean 20 dB 15 dB 10 dB 5 dB 0 dB

baseline 98.9 91.1 78.9 65.2 43.9 21.0 PMC 98.7 95.9 92.5 87.4 68.1 44.5

hard limiter 90.6 84.9 77.1 67.8 54.6 29.4 adaptive limiter 95.1 91.4 85.3 78.7 61.9 42.2

Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer.

Methods baseline PMC hard limiter adaptive limiter

CPUTimerecognition

(sec) 0.203

4.038

0.291

2.981

Page 23: 129966864160453838[1]

23

δ

(model adaptation)

δ

Fig. 1. Block diagram for implementing a speech recognizer with adaptive signal limiter.

0 375750112515001875225026253000337537501

6

11

-2

-1

0

1

2

3

4

magnitude (dB)

frequency (Hz)

frame index

baseline-clean

(a) LPC log magnitude spectra without signal limiter.

word models

segmental k-means

arcsin transform

autocorr. → LPC LPC→ cepstrum

estimate smoothing factor δ

autocorr. vectors of a context window

arcsin transform

speech recognizer

training utterances

autocorr. vectors of context windows

find smoothed delta cepstrum and covariance matrix

testing utterances

autocorr. → LPC LPC→ cepstrum

find smoothed delta cepstrum

recognition results

Page 24: 129966864160453838[1]

24

0 5001000150020002500300035001

713

-0.5

0

0.5

1

1.5

2magnitude (dB)

frequency (Hz)

frame index

hard limiter-clean

(b) LPC log magnitude spectra with hard limiter.

0 5001000150020002500300035001

611

-1.5-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

magnitude (dB)

frequency (Hz)

frame index

adaptive limiter-clean

(c) LPC log magnitude spectra with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)

Fig. 2. The various LPC log magnitude spectra of utterance ‘1’ in clean condition.

Page 25: 129966864160453838[1]

25

0 5001000150020002500300035001

611

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

magnitude (dB)

frequency (Hz)

frame index

baseline-white20dB

(a) LPC log magnitude spectra without signal limiter.

0 5001000150020002500300035001

611

-0.5

0

0.5

1

1.5

2

magnitude (dB)

frequency (Hz)

frame index

hard limiter-white20dB

(b) LPC log magnitude spectra with hard limiter.

Page 26: 129966864160453838[1]

26

0 5001000150020002500300035001

6

11

-1

-0.5

0

0.5

1

1.5

2

2.5magnitude (dB)

frequency (Hz)

frame index

adaptive limiter-white20dB

(c) LPC log magnitude spectra with adaptive signal limiter.

(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)

Fig. 3. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise.

0 5001000150020002500300035001

611

-1.5

-1-0.5

0

0.5

11.5

2

2.5

3

3.5

magnitude (dB)

frequency (Hz)

frame index

baseline-factory20dB

(a) LPC log magnitude spectra without signal limiter.

Page 27: 129966864160453838[1]

27

0 5001000150020002500300035001

6

11

-0.5

0

0.5

1

1.5

2

magnitude (dB)

frequency (Hz)

frame index

hard limiter-factory20dB

(b) LPC log magnitude spectra with hard limiter.

0 5001000150020002500300035001

611

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

magnitude (dB)

frequency (Hz)

frame index

adaptive limiter-factory20dB

(c) LPC log magnitude spectra with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)

Fig. 4. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise.

Page 28: 129966864160453838[1]

28

word '1' in white noiseusing baseline system

0

100

200

300

400

500

600

700

800

900

0dB 5dB 10dB 15dB 20dB cleanSNR values

log likelihoods

model 0 model 1 model 2model 3 model 4 model 5model 6 model 7 model 8model 9

(a) Comparison of average log likelihoods without signal limiter.

word '1' in white noiseusing hard limiter

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0dB 5dB 10dB 15dB 20dB cleanSNR values

log likelihoods

model 0 model 1 model 2model 3 model 4 model 5model 6 model 7 model 8model 9

(b) Comparison of average log likelihoods with hard limiter.

Page 29: 129966864160453838[1]

29

word '1' in white noiseusing adaptive signal limiter

600

700

800

900

1000

1100

1200

1300

1400

1500

0dB 5dB 10dB 15dB 20dB cleanSNR values

log likelihoods

model 0 model 1model 2 model 3model 4 model 5model 6 model 7model 8 model 9

(c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)

Fig. 5. The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise.

word '1' in factory noiseusing baseline system

200

300

400

500

600

700

800

900

0dB 5dB 10dB 15dB 20dB cleanSNR values

log likelihoods

model 0 model 1model 2 model 3model 4 model 5model 6 model 7model 8 model 9

(a) Comparison of average log likelihoods without signal limiter.

Page 30: 129966864160453838[1]

30

word '1' in factory noiseusing hard limiter

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0dB 5dB 10dB 15dB 20dB cleanSNR values

log likelihoods

model 0 model 1 model 2model 3 model 4 model 5model 6 model 7 model 8model 9

(b) Comparison of average log likelihoods with hard limiter.

word '1' in factory noiseusing adaptive signal limiter

600

800

1000

1200

1400

1600

0dB 5dB 10dB 15dB 20dB cleanSNR values

log likelihoods

model 0 model 1

model 2 model 3

model 4 model 5

model 6 model 7

model 8 model 9

(c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)

Fig. 6. The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise.