1 Smoothing Hidden Markov Models by Using an Adaptive Signal Limiter for Noisy Speech Recognition Wei-Wen Hung Department of Electrical Engineering Ming Chi Institute of Technology Taishan, 243, Taiwan, Republic of China E-mail : [email protected]FAX : 886-02-2903-6852; Tel. : 886-02-2906-0379 and Hsiao-Chuan Wang Department of Electrical Engineering National Tsing Hua University Hsinchu, 30043, Taiwan, Republic of China E-mail : [email protected]FAX : 886-03-571-5971; Tel. : 886-03-574-2587 Paper No. : 1033. (second review) Corresponding Author : Hsiao-Chuan Wang Key Words : hidden Markov model (HMM), hard limiter, adaptive signal limiter (ASL ), autocorrelation function, arcsin transformation.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Smoothing Hidden Markov Models by Using an
Adaptive Signal Limiter for Noisy Speech Recognition
Wei-Wen Hung
Department of Electrical Engineering Ming Chi Institute of Technology
Fig. 5 The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise. (a) Comparison of average log likelihoods without signal limiter.
(b) Comparison of average log likelihoods with hard limiter.
(c) Comparison of average log likelihoods with adaptive signal limiter.
Fig. 6 The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise. (a) Comparison of average log likelihoods without signal limiter.
(b) Comparison of average log likelihoods with hard limiter.
(c) Comparison of average log likelihoods with adaptive signal limiter.
Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer.
5
1. Introduction
When a speech recognition system trained in a well-defined environment is used in the real world
applications, the acoustic mismatch between training and testing environments will degrade its recognition
accuracy severely. This acoustic mismatch is mainly caused by a wide variety of distortion sources, such
as ambient additive noises, channel effect and speaker’s Lombard effect. During the past several decades,
researchers focused their attentions in dealing with the mismatch problem and tried to narrow the
mismatch gap. There are many algorithms have been proposed and successfully applied for robust
speech recognition. Generally speaking, the methods for handling noisy speech recognition could be
roughly classified into the following approaches (Sankar and Lee, 1996). The first approach tries to
minimize the distance measures between reference models and testing signals by adaptively adjusting
speech signals in feature space. For example, Mansour and Juang (Mansour and Juang, 1989) found that
the norm of a cepstral vector is shrunk under noise contamination. Therefore, they used a first-order
equalization method to adapt the cepstral means of reference models so that the shrinkage of speech
features can be adequately compensated. Likewise, Carlson and Clement (Carlson and Clement, 1994)
also proposed a weighted projection measure (WPM) for recognition of noisy speech in the framework
of continuous density hidden Markov model (CDHMM). In addition, the norm shrinkage of cepstral
means will also lead to the reduction of HMM covariance matrices. Thus, Chien et al., (Chien, 1997a;
Chien et al., 1997b) proposed a variance adapted and mean compensated likelihood measure
(VA-MCLM) to adapt the mean vector and covariance matrix simultaneously.
The second approach estimates a transformation function in model space for transforming reference
models into testing environment and thus the environmental mismatch gap can be effectively reduced. In
the literature, there were a number of techniques compensating ambient noise effect in model space.
Among them, one of the most promising techniques is the so-called parallel model combination (PMC).
In the PMC algorithm, Varga and Moore (Varga and Moore, 1992a) adapted the statistics of reference
6
models to meet the testing conditions by optimally combining the reference models and noise model in
linear spectral domain. In the later few years, several related works have been successively reported for
improving the performance of PMC method. Flores and Young (Flores and Young, 1992) integrated
spectral subtraction (SS) and PMC methods to seek for further improvement in recognition accuracy. In
addition, Gales and Young (Gales and Young, 1995) extended PMC scheme to include the effect of
convolutional noise.
In the third approach, a more robust feature representation is developed in signal space so that the
speech feature is invariant or less susceptible to environmental variations. In this approach, Lee and Lin
(Lee and Lin, 1993) developed a family of signal limiters as a preprocessor to smooth speech signals.
When a speech signal is passed through a signal limiter with zero smoothing factor (i.e., a hard limiter),
the hard limiting operation preserves the sign of an input speech signal and ignores its magnitude. Thus,
the hard-limited speech signal is only affected by ambient noises when the signal-to-noise ratio (SNR) is
relatively low. This smoothing process for feature vectors has been shown to be effective for reducing the
variability of feature vectors in a noisy environment and make them less affected by ambient noises over a
wide range of SNR values. Experimental results for recognition of 39-word alpha-digit vocabulary also
demonstrate that an equivalent gain of 5-7 dB in SNR can be achieved for a template-based DTW
recognizer.
However, from the experimental results reported by Lee and Lin (Lee and Lin, 1993), we can also
observe that the recognition accuracy using a hard limiter for clean speech becomes worse. This
phenomenon may be explained as follows. For an utterance, the amplitudes of unvoiced segments are
generally much lower than the amplitudes of voiced segments. Heavily smoothing can reduce feature
variability of the speech segments with low SNR, but it also causes the loss of some important
informations embedded in the clean segments and the segments with high SNR. Therefore, a signal limiter
with fixed smoothing factor might not work well for the all segments in a speech utterance. We suggest
7
that the smoothing factor of a signal limiter should be related to SNR value and adapted on a frame by
frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous and
dynamic spectral features of hidden Markov models (HMM) and testing speech signals. In addition, in
order to moderately reflect the variation of model covariance due to application of signal limiting
operation to the state statistics of word models, the adaptation of covariance matrix is also performed in
the sense of maximum likelihood (ML) estimation.
The layout of this paper is as follows. In the subsequent section, we describe the detailed formulation
of the proposed adaptive signal limiter and its extension to the framework of a continuous density hidden
Markov model. In Section 3, we investigate the behavior of LPC spectra of a speech utterance and its
signal-limited version under the influence of various ambient noises. In addition, a series of experiments
were conducted to compare the discriminability of different signal limiters in various noisy conditions.
Some experiments for recognition of multispeaker isolated Mandarin digits were performed in Section 4
to evaluate the effectiveness and robustness of the proposed method in presence of ambient noises.
Finally, a conclusion is drawn in Section 5.
2. Smoothing hidden Markov models by using an adaptive signal limiter
In this section, we describe the detailed formulation of the proposed adaptive signal limiter (ASL) and
its extension to the framework of an HMM-based speech recognizer.
2.1 Representation of the underlying hidden Markov models
Conventionally, for a continuous density hidden Markov model (CDHMM), the output likelihood
measure of tth− frame in the testing utterance { }Y y ctdt t Tyt= = ≤ ≤[ , ],1 based on the
statistics of i th− state of word model { }Λ Λ Σ( ) ( , ),, , ,
w i Sw i w i w i w= = ≤ ≤µ 1 can be
8
characterized by a multivariate Gaussian probability density function (pdf) and formulated as
py p y yt w i w i t w iT
w i t w i( ) ( ) exp ( ) ( )
, , , , ,Λ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ −
−−
21
2
121π µ µ , (1)
where µw i w i w ic d, , ,[ , ]= denotes the mean vector of i th− state of word model Λ( )w and
consists of p− order cepstral vector cw i, and p− order delta cepstral vector dw i,. Σw i, denotes
the covariance matrix of i th− state of word model Λ( )w and is simplified as a diagonal matrix, i.e.,
Σw i w i w i w idiag p, , , ,[ () () ( )]= ⋅ ⋅ ⋅ ⋅σ σ σ2 2 21 2 2 . However, in order to adequately reflect the
variation of dynamic spectral features due to application of a signal limiting operation to instantaneous
spectral features, the representation of state statistics in a conventional hidden Markov model is modified
slightly. In our approach, the mean vector µw i w i w ic d, , ,[ , ]= of i th− state of the word model
Λ( )w is indirectly represented by the normalized autocorrelation vectors of a five-frame context
window (Lee and Wang, 1995), that is [ , , , , ],, ,, ,, ,, ,,r r r r rw i w i w i w i w i− −2 1 0 1 2 , where
r r r pw ij w ij w ijT
,, ,, ,,[ (),, ( )]= ⋅⋅ ⋅1 , j=0 denotes the instantaneous frame, j=-1, -2 the left context
frames and j=1, 2 the right context frames. The estimation of those normalized autocorrelation vectors
in a five-frame context window is proceeded as follows. Firstly, a conventional hidden Markov model is
trained for each word by means of the segmental k-means algorithm. Then, based upon the obtained
word models, each frame in the training utterances is labeled with its decoded state identity by using the
Viterbi decoding algorithm. Those instantaneous, left-context and right-context autocorrelation vectors
corresponding to the same state identity are collected and averaged to obtain the indirect representation
of the underlying hidden Markov models. For example, the normalized autocorrelation vectors of i th−
state of the word model Λ( )w can be formulated by
[ , , , , ]
[ , , , , ]
,,, ,, ,, ,,1 ,,
, , , , ,,r r r r r
r r r r r
Nw i w i wi w i w i
wtu
wtu
wtu
wtu
wtu
ut
s− −
− − + +
=∑
2 1 0 2
2 1 1 2
(2)
9
where rw tu, represents the normalized autocorrelation vector of the u th− training utterance, t th−
frame of word w. Above summation includes all the Ns frames which are labeled with the state
identity i of word model Λ( )w .
Based upon this indirect representation, the analysis equations of linear predictive coding (LPC) model
can be expressed in matrix form as
R a rw ij w ij w ij,, ,, ,,⋅ = forj= − ⋅⋅ ⋅2 2,,, (3)
where Rw ij,, is an autocorrelation matrix of the form
R
r r r p
r r r p
r p r p r
w ij
w ij w ij w ij
w ij w ij w ij
w ij w ij w ij
,,
,, ,, ,,
,, ,, ,,
,, ,, ,,
() () ( )
() () ( )
( ) ( ) ()
=
−−
− −
0 1 1
1 0 2
1 2 0
LL
M M O ML
. (4)
Since the autocorrelation matrix is Toeplitz symmetric and positive definite, the LPC coefficient vector
a a a a pw ij w ij w ij w ijT
,, ,, ,, ,,[ () () ()]= 1 2 L can be solved efficiently by the Levinson-Durbin
recursion method (Rabiner and Juang, 1993). Once we obtain the LPC coefficient vector for Eq. (3), the
corresponding cepstral vector cw ij,, can be recursively calculated by using the LPC to cepstral
coefficient conversion formula
c m a m kmc k a m kw ij w ij w ij w ij
k
m
,, ,, ,, ,,( ) ( ) ( ) ( ) ( )= + ⋅ ⋅ −=
−
∑1
1
, 1≤ ≤m p. (5)
Finally, the cepstral vector of instantaneous frame, i.e., cw ij,, forj=0, is used as the mean vector,
cw i,, of i th− state of word model Λ( )w . In addition, the corresponding delta cepstral vector
dw i, can also be calculated by using the following equation :
d
jc
jw i
w ijj
j
j
j,
,,
=
⋅=−
=
=−
=
∑
∑
2
2
2
2
2. (6)
10
2.2 Formulation of the adaptive signal limiter
For recognition of noisy speech, it has been observed that employing a signal limiter to smooth a
speech signal in time domain leads to significant performance improvement. The basic theory of a signal
limiter can be roughly described as follows (Lee and Lin, 1993). When a signal x is passed through a
signal limiter, the signal limiting operation is equivalent to performing a nonlinear transformation on the
input signal and so that the corresponding output signal y can be essentially characterized by an error
function of the form :
y sxK
t dtx
= =⋅ ⋅
⋅ − ⋅∫() exp[( )] ,2
22
2 2
0π σσ (7)
where K is a scaling constant and σ 2 is a tunable factor for adjusting the smoothing degree of a
signal limiting operation.
In light of above pronounced smoothing property, a signal limiter can be readily extended to the
processing of speech signals in a noisy environment. Consider an input speech signal x, approximated
by a zero mean, stationary Gaussian process with variance σx2, has the density function as
gxx
x x
() exp .=⋅ ⋅
⋅ −⋅
1
2 22
2
2π σ σ
(8)
Then, the output y of a signal limiter has the density function expressed as (see Appendix A)
( ) ,2
1exp))(()( 2
2
⋅⋅⋅−−⋅==
x
xK
xshyhσδ
δδ
(9)
where x s y= −1( ). δ denotes the smoothing factor of a signal limiter and is defined as δ σ σ= 2 2x.
The larger the value of δ , the smaller the value of output signal y. When the smoothing factor δ
approaches 0, the corresponding signal limiter changes into a hard limiter of the form
y fxK ifx
ifxK ifx
= =>=
− <
()
2 00 0
2 0. (10)
A signal limiting operation can also be interpreted as an arcsin transformation in autocorrelation domain.
Assume that the autocorrelation functions of input speech signal x and its signal-limited output y are
denoted as rx()τ and ry()τ , respectively. Then, the normalized autocorrelation function of a
signal-limited output y can be formulated as (Lee and Lin, 1993)
11
[ ]
[ ]r
rr
r
r
ry
y
y
x()()
()
sin ()( )
sin( )τ
τ τ δ
δ≡ =
+
+
−
−0
1
11
1
1, (11)
where rr r rx x x() () ()τ τ≡ 0 is the normalized autocorrelation function of the input speech signal x.
By properly adjusting the smoothing factor δ , various degrees of smoothing effect can be obtained.
When δ approaches infinity, the normalized autocorrelation function of an input speech signal rrx()τ
is almost equal to the normalized autocorrelation function of the corresponding signal-limited output rry()τ . Furthermore, in the case of δ = 0, the normalized autocorrelation function of the signal-limited
output rry()τ is reduced to the following equation (see Appendix B) :
[ ]r rr ry x() sin ().τ
πτ= ⋅ −2 1 (12)
In the literature presented by Lee and Lin (Lee and Lin, 1993), they used a hard limiter as a
pre-processor to reduce the variability of feature vectors in noisy conditions. That is, a pre-determined
smoothing factor is used throughout a speech signal. However, it is known that the segments of clean
speech with less energy are influenced most by ambient noises and thus require heavily smoothing. As to
the clean segments and the segments with high SNR, excessively smoothing not only destroys their
distinct features but also reduces the discriminability of speech features in a noisy environment. Therefore,
we propose an adaptive signal limiter (ASL) in which the smoothing factor δ is related to SNR and
adapted on a frame by frame basis. In the proposed adaptive signal limiter, the smoothing factor δ is
empirically formulated as :
( )δ
δ
δ δδ
δ
( ) ,
min
max minmin
max
SNR
ifSNRSNR
SNR SNRSNRSNR ifSNR SNRSNR
ifSNRSNR
LB
UB LBLB LB UB
UB
=
<
−−
⋅ − + ≤ ≤
>
(13)
and SNRE
Es
n
≡ ⋅10 10log( ), (14)
where δ δmin max, , ,SNR SNRLB UB are tuning constants, Es is the frame energy of a clean speech
signal and En is the noise energy. In the subsequent experiments, the arcsin transformation shown in
Eq.(11)-(14) are used to compute the normalized autocorrelation of a signal-limited signal rather than
directly applying the nonlinear operation of Eq. (7) on the input signal. This is because of that the
underlying hidden Markov models are indirectly represented by the LPC-based spectral features. The
LPC spectral features can be efficiently calculated from autocorrelation function by means of Eq.(5).
12
Moreover, comparing with the signal limiting operation shown in Eq.(7), the arcsin transformation
requires less computation cost.
2.3 Adaptations of dynamic spectral feature and covariance matrix
When a signal limiting operation is performed on the autocorrelation function of a speech signal, it not
only smooths instantaneous spectral vectors but also leads to reduction of the corresponding dynamic
spectral features and model covariance matrices. Therefore, in order to achieve higher consistency, the
adaptations of a model’s dynamic spectral features and its covariance matrices are necessary. This
adaptation procedure is proceeded as follows. When the t th− frame yt of a testing utterance Y is
evaluated on the state Λw i, , the cepstral vectors ctj, of its context frames ytj, , for j− ≤ ≤2 2,
are first transformed to give the corresponding normalized autocorrelation vectors rtj, . Then, those
normalized autocorrelation vectors r r r ptj tj tjT
, , ,[ (),, ()]= ⋅ ⋅ ⋅1 are processed by the following
arcsin transformation :
[ ]
[ ]
~ ()
sin()
( )
sin( )
,,
,
,
,
r
r
SNR
SNR
tj
tj
tj
tj
τ
τ
δ
δ
=+
+
−
−
1
1
1
1
1
for − ≤ ≤2 2j and 1≤ ≤τ p. (15)
In above equation, the SNRtj, variable is determined by
SNRE E
Etjtj n
n, log
( ),= ⋅
−
−10 10 (16)
where Et is the tth− frame energy in the testing utterance Y . En is the noise energy and can be
roughly estimated by selecting the lowest energy in the testing utterance Y , i.e.,
{ }E E E En Ty= ⋅⋅ ⋅min, ,,1 2 . Once the smoothed autocorrelation vectors ~ ,,r for jtj − ≤ ≤2 2, were
obtained, the smoothed testing cepstral vector ~,ctj of ~,ytj can be calculated by means of the LPC to