Top Banner
Voice Activity Detection based on Optimal lyWeighted Combination of Multiple Featur es Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan Presenter: Chen, Hung-Bin Eurospeech 2 005 ICSLP 2006
23

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Jan 17, 2016

Download

Documents

Damon

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features. Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan Presenter: Chen, Hung-Bin. Eurospeech 2005 ICSLP 2006. Outline. Introduction - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Yusuke Kida and Tatsuya Kawahara

School of Informatics, Kyoto University,

Sakyo-ku, Kyoto 606-8501, Japan

Presenter: Chen, Hung-Bin

Eurospeech 2005 ICSLP 2006

Page 2: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Outline

• Introduction

• Weighted Combination of VAD Methods– Features and Methods for VAD

• Weight Optimization Using MCE Training

• Experiment

• Conclusion

Page 3: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

Introduction

• Voice activity detection (VAD) is a vital front-end in automatic speech recognition (ASR) systems, especially to perform robustly in noisy environments.– If speech segments are not correctly detected, the subsequent

recognition processes would be often meaningless.

• However, there are a variety of noise conditions and no single method is expected to cope with all of them.

• In order to realize VAD robust against various kinds of noise, we have proposed a combination of multiple features.

Page 4: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

4

Weighted Combination of VAD Methods

• The framework of our VAD system is shown in Figure 1.

• Four features are calculated: – amplitude level, ZCR, spectral

information, and GMM likelihood.

• The features are shown as f(1), · · · , f(4) in the figure, and they are combined with weights w1, · · · , w4.

Page 5: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

5

Features and Methods for VAD

• Amplitude level– Amplitude level is one of the most common features of VAD met

hods and is used in many applications.– The amplitude level at the t-th frame Et is computed as the logari

thm of the signal energy; • for N-length Hamming-windowed speech samples

– Then, the feature used in the combination is calculated using the ratio of amplitude of the input frame to the amplitude of noise:

• where En denotes the amplitude level of noise

n

tt E

Ef )1(

N

nnt sE

1

2log

Page 6: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

6

Features and Methods for VAD

• Zero crossing rate (ZCR)– Zero crossing rate (ZCR) is the number of times the signal level

crosses 0 during a fixed period of time.– Similarly to amplitude level, a ratio of the input frame to noise is u

sed for this feature. The feature is calculated as follows:• where Zt denotes the ZCR of the input frame, and Zn denote

s that of noise

)2(tf

n

tt Z

Zf )2(

Page 7: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

7

Features and Methods for VAD

• Spectral information– As shown in the figure, we pa

rtition the frequency domain into several channels and calculate the signal to noise ratio (SNR) for each channel.

– Then, we compute the average value of each SNR as spectral information.

Page 8: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

8

Features and Methods for VAD

• Spectral information– The spectral information feature is defined as

• where B denotes the number of channels • The term and indicate the average intensity within cha

nnel b for speech and noise.

)3(tf

B

b b

btt

N

S

Bf

12

2

10)3( log10

1

btS bN

Page 9: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

9

Features and Methods for VAD

• GMM likelihood– A log-likelihood ratio of speech GMM to noise GMM for input fra

mes is used for the GMM feature.– The feature is calculated as

• where and denote the model parameter set of GMM for the speech and noise, respectively

)4(tf

s n

))|(log())|(log()4(ntstt xpxpf

Page 10: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

10

Weighted Combination of VAD Methods

• The combined score of data frame (t: frame number) is defined as follows:– where K denotes the number of combined features

• The weights must satifsy the following conditions:– where the initial weights are all equal

tx

)()(4

1

)(t

k

kkt xfwxF

kw

011

k

K

kk ww ,

Page 11: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

11

Weighted Combination of VAD Methods

• The two discriminative functions judge whether each frame is speech or noise.– where θ denotes the threshold value of the combined score

• Data is regarded as a speech frame if the discriminative function of speech is larger than that of noise . Otherwise, is regarded as a noise frame.

)()(

)()(

tn

ts

xFxg

xFxg

tx)( ts xg )( tn xg

tx

framespeech -non a is )()( else

framespeech a is )()( if

xgxg

xgxg

ns

ns

Page 12: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

12

Weight Optimization Using MCE Training

• To adapt our VAD scheme to noisy environments, we applied MCE training to the optimization of the weights.

• misclassification measure– For the MCE training, the misclassification measure of training d

ata frame is defined as– where k denotes the true cluster and m indicates another cluster

• loss function– The loss function is defined as a differential sigmoid function

approximating the 0-1 step loss function:– where γ denotes the gradient of the sigmoid function

tx

)()()( tmtktk xgxgxd

kl

1))exp(1()( ktk dxl

Page 13: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

13

Weight Adjustment

• During the weight adjustment in the MCE training, the weight set is transformed into a new set because of a constraint ( > 0);

• The weight adjustment is defined as:– where is a monotonically decreasing learning step size

kww w~

kk

k

ww

wwww

log~

~,,~,~~21

)()(~)1(~tkt xltwtw

t

Page 14: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

14

Weight Adjustment

• The gradient of weight adjustment equation is obtained as follows.

))(()11())1(()(~)1(~tllkktll xFworlltwtw

)(~)()(

, 1

, 1)()(

)1())exp(1(

)exp(

)exp(1

1

)()(

1~

2

~~

tlll

lL

ltll

ltjw

tmtkjj

k

kkk

k

kkk

k

tjwj

k

k

ktkw

xFww

wxFw

wxg

kj

kjxgxg

gg

d

lld

d

ddd

l

where

xgg

d

d

lxl

l

ll

Page 15: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

15

Weight Adjustment

• After is updated– is returned to and normalization of the weightsw

w~

w~

L

ll

kk

w

ww

1

~exp

~exp

Page 16: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

16

Experiment

• Testing set– Speech data from ten speakers were used.– Ten utterances were used for testing for each speaker.

• Each utterance lasted a few seconds, and three-seconds pauses were inserted between them.

– noisy data• added the noises of sensor room, machine, and background

speech to the clean speech data by SNR (10, 15dB)– In total, we had 600 (= 3 types × 2 SNR × 10 persons × 10

utterances) samples as the test set.

• Training set– A different set of ten utterances, whose text is different from the

test set, was used for the weight training for each condition.

Page 17: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

17

Experiment

• The frame length – 100-ms for amplitude level and ZCR– 250-ms for spectral information, and GMM likelihood– frame shift was 250-ms for each feature

• Noise features – such as amplitude level , ZCR and spectral

information were calculated using the first second of speech data

• GMM likelihood– 32-component GMM with diagonal covariance

matrices was used to model speech and noise– JNAS (Japanese Newspaper Article Sentences) corpus that includes

306 people and about 32000 utterances was used to train the speech GMM parameters with EM algorithm

bNnE nZ

100

250

100

250

Page 18: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

18

Evaluation

• evaluate the performance– The frame-based false alarm rate (FAR) and false rejection rate

(FRR) were used as evaluation measures– FAR is the percentage of non-speech frames incorrectly

classified as speech– FFR is the percentage of speech frames incorrectly classified as

non-speech– EER is the threshold values for its false acceptance rate and its

false rejection rate, and when the rates are equal, the common value is referred to as the equal error rate

Page 19: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

19

Experimental Results

• The equal error rate (EER) under each noise type where the SNR was 10dB is shown in Table

Before the training, the weights are set to equal (= 0.25).

Page 20: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

20

Experimental Results

• These figures compare our proposed method to the individual methods we combined. – The horizontal axis corresponds to the FAR, and the vertical axis

corresponds to the FRR.

Page 21: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

21

Application and Evaluation in ASR

• For evaluation in ASR, we collected 1345 utterances from the same ten speakers1, and made a test set by adding the same three types of noise with SNR of 5, 10 and 15db. – Thus, we have 12105 samples (= 3 noise types × 3 SNR × 1345

utterances).

• The acoustic model is a phonetic tied-mixture (PTM) triphone model based on multicondition training.

• The recognition task is simple conversation with a robot.

• A finite state automaton grammar is handcrafted with a vocabulary of 865 words.

INTERSPEECH 2006 - ICSLP

Page 22: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

22

Experimental Results

• In Tables 4∼6, ASR performance in word accuracy

INTERSPEECH 2006 - ICSLP

AC: air conditioner, CM: craft machine, BS: background speech

Page 23: Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features

23

Conclusion

• In this paper is presented a robust VAD method by adaptively combining the four different features.

• The proposed method realizes the significantly better performance than the conventional individual techniques.

• It is also shown that the weight adaptation is possible with only one utterance and as reliable as in the closed training.