Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Yusuke Kida and Tatsuya Kawahara

School of Informatics, Kyoto University,

Sakyo-ku, Kyoto 606-8501, Japan

Presenter: Chen, Hung-Bin

Eurospeech 2005 ICSLP 2006

Outline

• Introduction

• Weighted Combination of VAD Methods– Features and Methods for VAD

• Weight Optimization Using MCE Training

• Experiment

• Conclusion

Introduction

• Voice activity detection (VAD) is a vital front-end in automatic speech recognition (ASR) systems, especially to perform robustly in noisy environments.– If speech segments are not correctly detected, the subsequent

recognition processes would be often meaningless.

• However, there are a variety of noise conditions and no single method is expected to cope with all of them.

• In order to realize VAD robust against various kinds of noise, we have proposed a combination of multiple features.

4

Weighted Combination of VAD Methods

• The framework of our VAD system is shown in Figure 1.

• Four features are calculated: – amplitude level, ZCR, spectral

information, and GMM likelihood.

• The features are shown as f(1), · · · , f(4) in the figure, and they are combined with weights w1, · · · , w4.

5

Features and Methods for VAD

• Amplitude level– Amplitude level is one of the most common features of VAD met

hods and is used in many applications.– The amplitude level at the t-th frame Et is computed as the logari

thm of the signal energy; • for N-length Hamming-windowed speech samples

– Then, the feature used in the combination is calculated using the ratio of amplitude of the input frame to the amplitude of noise:

• where En denotes the amplitude level of noise

n

tt E

Ef )1(

N

nnt sE

1

2log

6


• Zero crossing rate (ZCR)– Zero crossing rate (ZCR) is the number of times the signal level

crosses 0 during a fixed period of time.– Similarly to amplitude level, a ratio of the input frame to noise is u

sed for this feature. The feature is calculated as follows:• where Zt denotes the ZCR of the input frame, and Zn denote

s that of noise

)2(tf

n

tt Z

Zf )2(

7


• Spectral information– As shown in the figure, we pa

rtition the frequency domain into several channels and calculate the signal to noise ratio (SNR) for each channel.

– Then, we compute the average value of each SNR as spectral information.

8


• Spectral information– The spectral information feature is defined as

• where B denotes the number of channels • The term and indicate the average intensity within cha

nnel b for speech and noise.

)3(tf

B

b b

btt

N

S

Bf

12

2

10)3( log10

1

btS bN

9


• GMM likelihood– A log-likelihood ratio of speech GMM to noise GMM for input fra

mes is used for the GMM feature.– The feature is calculated as

• where and denote the model parameter set of GMM for the speech and noise, respectively

)4(tf

s n

))|(log())|(log()4(ntstt xpxpf

10


• The combined score of data frame (t: frame number) is defined as follows:– where K denotes the number of combined features

• The weights must satifsy the following conditions:– where the initial weights are all equal

tx

)()(4

1

)(t

k

kkt xfwxF

kw

011

k

K

kk ww ,

11


• The two discriminative functions judge whether each frame is speech or noise.– where θ denotes the threshold value of the combined score

• Data is regarded as a speech frame if the discriminative function of speech is larger than that of noise . Otherwise, is regarded as a noise frame.

)()(

)()(

tn

ts

xFxg

xFxg

tx)( ts xg )( tn xg

tx

framespeech -non a is )()( else

framespeech a is )()( if

xgxg

xgxg

ns

ns

12

Weight Optimization Using MCE Training

• To adapt our VAD scheme to noisy environments, we applied MCE training to the optimization of the weights.

• misclassification measure– For the MCE training, the misclassification measure of training d

ata frame is defined as– where k denotes the true cluster and m indicates another cluster

• loss function– The loss function is defined as a differential sigmoid function

approximating the 0-1 step loss function:– where γ denotes the gradient of the sigmoid function

tx

)()()( tmtktk xgxgxd

kl

1))exp(1()( ktk dxl

13

Weight Adjustment

• During the weight adjustment in the MCE training, the weight set is transformed into a new set because of a constraint ( > 0);

• The weight adjustment is defined as:– where is a monotonically decreasing learning step size

kww w~

kk

k

ww

wwww

log~

~,,~,~~21

)()(~)1(~tkt xltwtw

t

14

Weight Adjustment

• The gradient of weight adjustment equation is obtained as follows.

))(()11())1(()(~)1(~tllkktll xFworlltwtw

)(~)()(

, 1

, 1)()(

)1())exp(1(

)exp(

)exp(1

1

)()(

1~

2

~~

tlll

lL

ltll

ltjw

tmtkjj

k

kkk

k

kkk

k

tjwj

k

k

ktkw

xFww

wxFw

wxg

kj

kjxgxg

gg

d

lld

d

ddd

l

where

xgg

d

d

lxl

l

ll

15

Weight Adjustment

• After is updated– is returned to and normalization of the weightsw

w~

w~

L

ll

kk

w

ww

1

~exp

~exp

16

Experiment

• Testing set– Speech data from ten speakers were used.– Ten utterances were used for testing for each speaker.

• Each utterance lasted a few seconds, and three-seconds pauses were inserted between them.

– noisy data• added the noises of sensor room, machine, and background

speech to the clean speech data by SNR (10, 15dB)– In total, we had 600 (= 3 types × 2 SNR × 10 persons × 10

utterances) samples as the test set.

• Training set– A different set of ten utterances, whose text is different from the

test set, was used for the weight training for each condition.

17

Experiment

• The frame length – 100-ms for amplitude level and ZCR– 250-ms for spectral information, and GMM likelihood– frame shift was 250-ms for each feature

• Noise features – such as amplitude level , ZCR and spectral

information were calculated using the first second of speech data

• GMM likelihood– 32-component GMM with diagonal covariance

matrices was used to model speech and noise– JNAS (Japanese Newspaper Article Sentences) corpus that includes

306 people and about 32000 utterances was used to train the speech GMM parameters with EM algorithm

bNnE nZ

100

250

100

250

18

Evaluation

• evaluate the performance– The frame-based false alarm rate (FAR) and false rejection rate

(FRR) were used as evaluation measures– FAR is the percentage of non-speech frames incorrectly

classified as speech– FFR is the percentage of speech frames incorrectly classified as

non-speech– EER is the threshold values for its false acceptance rate and its

false rejection rate, and when the rates are equal, the common value is referred to as the equal error rate

19

Experimental Results

• The equal error rate (EER) under each noise type where the SNR was 10dB is shown in Table

Before the training, the weights are set to equal (= 0.25).

20


• These figures compare our proposed method to the individual methods we combined. – The horizontal axis corresponds to the FAR, and the vertical axis

corresponds to the FRR.

21

Application and Evaluation in ASR

• For evaluation in ASR, we collected 1345 utterances from the same ten speakers1, and made a test set by adding the same three types of noise with SNR of 5, 10 and 15db. – Thus, we have 12105 samples (= 3 noise types × 3 SNR × 1345

utterances).

• The acoustic model is a phonetic tied-mixture (PTM) triphone model based on multicondition training.

• The recognition task is simple conversation with a robot.

• A finite state automaton grammar is handcrafted with a vocabulary of 865 words.

INTERSPEECH 2006 - ICSLP

22


• In Tables 4∼6, ASR performance in word accuracy

INTERSPEECH 2006 - ICSLP

AC: air conditioner, CM: craft machine, BS: background speech

23

Conclusion

• In this paper is presented a robust VAD method by adaptively combining the four different features.

• The proposed method realizes the significantly better performance than the conventional individual techniques.

• It is also shown that the weight adaptation is possible with only one utterance and as reliable as in the closed training.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Documents

speech frame

amplitude of noise

noise gmm

frame number

common features of vad

gmm feature

noise ratio snr

speech segments