CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMSshodhganga.inflibnet.ac.in/bitstream/10603/26451/9/09_chapter4.pdf · In this chapter, the Voice Activity Detection (VAD) developed by

66

CHAPTER 4

VOICE ACTIVITY DETECTION ALGORITHMS

4.1 INTRODUCTION

New frontiers of speech technology are demanding increased levels

of performance in many areas. In the advent of Wireless Communications

new speech services are becoming a reality with the development of modern

robust speech processing technology. Many researchers discussed about the ill

effect of environmental noise on the system performance of speech

processing. Abhijeet Sangwan et al (2002) discussed many issues associated

with desirable aspects of Voice Activity Detection (VAD) algorithms based

on a good decision rule, adaptability to background noise and low

computational complexity for estimating the noise spectrum.

Background noise acoustically added to speech can degrade the

performance of digital voice processors used for applications such as speech

compression, recognition, and authentication (Isrel 2003). Digital voice

systems will be used in a variety of environments, and their performance must

be maintained at a level near that measured using noise-free input speech. To

ensure continued reliability, the effects of background noise can be reduced

by using, internal modification of the voice processor algorithms to explicitly

compensate for signal contamination, or preprocessor noise reduction and

noise-cancelling microphones

67

Khaled et al (1997) observed that high-energy voiced speech

segments are always detected in all VADs under very noisy conditions such

as car, bus, babble, and street noise. However, low-energy unvoiced speech is

commonly missed. The background noise which contaminates the signal

results in either noise only or speech plus noise segments. The VAD

developed by Javier Ramirez et al (2005), makes it possible to define an

effective endpoint detection algorithm employing a novel noise reduction

techniques and order statistic filters for the formulation of the decision rule.

The VAD performs an advanced detection of beginnings and delayed

detection of word endings which, in part, avoids the inclusion of additional

hangover schemes. In addition, VAD provides speech / non speech

discrimination also. It has been observed that low energy portions of speech

are first to be falsely rejected. A hangover scheme is required to lower the

probability of false rejections (Alan Davis et al 2006).

Robustness can be achieved by an appropriate extraction of robust

features in the front-end and/or by the adaptation of the reference to the noise

situation. Noise signals are selected to represent the most probable application

scenarios for telecommunication terminals. Some noises are fairly stationary.

They are the car noise and the recording in the exhibition hall. Others noises

contain non-stationary like, the recordings on the street and at the airport.

A fast noise estimation algorithm proposed by Sundarrajan

Rangachari et al (2004) resulted in good performance for a single sentence.

The noise estimate was found by averaging past spectral power values using a

smoothing parameter that was adjusted by the signal presence probability in

subbands case as discussed Sundarrajan Rangachari et al (2004).

68

A novel VAD algorithm developed by Dong Kook Kim et al (2007)

based on the Gaussian distribution and the uniformly most powerful (UMP)

test to detect the speech or nonspeech from the input noisy signal. This

method provides the decision rule by comparing the magnitude of the noisy

speech signal to the adaptive threshold estimated from the noise statistics.

A conditional Maximum a posteriori (MAP) criterion decides the

hypothesis with the maximum conditional probability given both the

observation and the voice activity in the previous frame. This criterion leads

to two separates thresholds for Likelihood Ratio Test (LRT) depending on the

previous VAD result frame case as discussed Jong Won Shin et al (2008).

Several VAD algorithms have been proposed for detecting the

voiced / unvoiced region (Boll Steven et al 1980, Dhananjaya et al 2010, Falk

Tiago et al 2006, Haitian Xu et al 2007, Jongseo Sohn et al 1999, Juan

Manuel Gorriz et al 2008, Matteo Gerosa et al 2007, Plante et al 1998, Qi Li

et al 2002, Richard et al 2000, Yutaka Kaneda et al 1986, Zenton Goh et al

1999, Zhong Lin et al 2007).

In this chapter, the Voice Activity Detection (VAD) developed by

Ramirez et al (2005) is presented along with the noise estimation algorithm as

discussed in Sundarrajan Rangachari et al (2004) and Abhijeet Sangwan et al

(2002). Various VAD algorithms are studied and comparison of their

performance based on parameters such as Zero Crossing Detection (ZCD),

Weak Fricative Detection (WFD), Pitch Based Detection (PBD), Energy

Based Detection (EBD) and Subband Order Statistics Filter (OSF) in presence

of different types of noise like suburban train noise, babble, car, exhibition

hall, restaurant, street, airport and train-station noise for Automatic Speech

Recognition (ASR) are carried out.

69

4.2 VOICE ACTIVITY DETECTION ALGORITHMS

A straight forward approach is to identify Voice Activity Detection

(VAD), i.e, the processes of discrimination of speech from silence or other

background noise. The VAD algorithms are based on any combination of

general speech properties such as temporal energy variations, periodicity, and

spectrum. The detection task is not as trivial as it appears since the increasing

level of background noise degrades the classifier effectiveness. VAD

indicates the presence or absence of speech as observed by Ramirez 2004.

Voice is differentiated into speech or silence based on speech

characteristics. The signal is sliced into contiguous frames. A real valued

nonnegative parameter is associated with each frame. If this parameter

exceeds a certain threshold, the signal is classified as active or inactive.

The basic principle of VAD device is that it extracts some

measured features or quantities from the input signal and then compares these

values with thresholds. Voice activity (VAD=1) is declared if the measured

value exceeds the threshold. Otherwise (VAD=0) is declared for no speech

activity. In general, a VAD algorithm outputs a binary decision in a frame by

frame basis where a frame of a input signal is a short unit of time such as 20-

40ms.

The following are some of the required features of a good VAD

algorithm:

(i) Good Decision Rule: A physical property of speech that can

be exploited to give consistent and accurate judgment in

classifying segments of the signal into silence or otherwise.

70

(ii) Adaptability to background Noise: Adapting to non stationary

background noise improves the robustness, especially in

wireless telephony.

(iii) Low Computational Complexity. The complexity of VAD

algorithm must be low to suit real-time applications.

A tree diagram that represents the classification techniques for

VAD algorithms are shown in Figure 4.1.

Figure 4.1 Tree diagram for VAD Algorithms

The VAD Algorithm is classified into two types.

(i) Parameter Based VAD Algorithm and

(ii) Frequency Based VAD Algorithm.

Parameter Based

VAD Algorithms

Frequency Based

Thresholding: ZCD

Linear Variance: EBD

Segmentation :PBD

Transform (Power Spectral Density):WFDSubband OSF

71

Parameter Based VAD Algorithms are further classified into three

types.

(i) Zero Crossing Detector which based on thresholding.

(ii) Energy Based Detection which is implemented through Linear

Variance.

(iii) Pitch Based Detection through Segmentation.

Frequency Based VAD Algorithm which consists of Weak

Fricative and Subband Order Statistic Filter which are formed under

Transformation method.

4.2.1 Zero Crossing Detector (ZCD)

The Zero Crossing Detector (ZCD) is defined as the number of

times in a sound sample that the amplitude of sound wave changes sign. Zero

Crossing for a signal is the number of times that it crosses the line of no

disturbance or zero line (Abhijeet Sangwan et al 2002). The number of zero

crossings for a voice signal lies in fixed range. For a 10ms duration, the

number of zero crossing lies between 5 and 15. The number of zero crossing

for noise is random and unpredictable. This reason innovate formulate a

decision rule that is independent of energy and hence able to detect some low

energy phonemes.

If Frame is ACTIVE

else Frame is INACTIVE (4.1)

is the number of Zero Crosses detected in fj R is the set

of values of {5-15}, the number of zero crossing for speech duration of 10ms.

72

4.2.2 Weak Fricatives Detector (WFD)

The main drawback of ZCD is that of misclassification of noise

frames as Active one when zero crossings of the noise frames satisfies the

equation (4.1).

The problem of discriminating speech from background noise is not

trivial, except in the case of extremely high Signal to Noise Ratio acoustic

environments. For such high Signal to Noise Ratio (SNR) environments, the

energy of the lowest level speech sounds exceeds the background noise

energy, and thus a simple energy measurement suffices. However, such ideal

recording conditions are not practical for most applications (Rabiner 2004).

Therefore a method is required to classify weak fricatives from

noise dependent of SNR or other noise characteristics. This particular

problem can be made to overcome by using Auto correlation function which

is exploited by the high correlation found in speech signals.

The unbiased autocorrelation function as

[ ] = [ ]( ) [ ] (4.2)

A[x] is the autocorrelation vector

y[n] is the vector under consideration

n is the frame length

Each frame of the incoming signal is segmented into frames of

duration 20ms. The energy of each frame is computed as

= (( 1) 4 + )

73

2where subframes takes the value from 1 to the total number of subframes in

the sample, index denotes each sample in the given vector. Thus a vector of

20 such energy values is computed for each frame, which is denoted as

( ) (4.4)

where j is the frame under consideration. The classification parameter that is

used the variance of the above vector. The Autocorrelation Vector Variance

(AVV) is determined as

( ( ) ) (4.5)

A reference value for AVV for silence frame is computed by

assuming that the first 20 frames to be inactive

= ( ) (4.6)

A reference value for AVV for silence frame is computed by

assuming that the first 20 frames to be inactive. We compare the AVV of

subsequent frames with a scalar multiple of this reference value, to determine

speech activity.

If Frame is ACTIVE

else Frame is INACTIVE (4.7)

The value of k was set to 7 after trial and error. Only active frames

are marked as voiced signal and inactive frames are unvoiced signal.

4.2.3 Pitch Period Based Detector (PBD)

Pitch period estimation is one of the most important problems in

speech processing. Pitch detectors are used in vocoders, speaker identification

74

and verification systems. Pitch period estimation can be done using the

autocorrelation function. The autocorrelation function provides a convenient

representation and it forms the basis for pitch detection.

One of the major limitations of using the autocorrelation

representation is that of retention of information in the speech signals. As a

result the autocorrelation function has too many peaks. To estimate this

problem it is useful to process the speech signal so as to make the periodicity

more prominent while suppressing other distracting features of the signal.

Numerous techniques have been proposed and a technique called centre

clipping is reported in this thesis.

The centre clipped (Sondhi 1968) speech signal is obtained by a

nonlinear transformation

( ) = [ )] (4.8)

where C[ ] is shown in Figure 4.2

Figure 4.2 Centre clipper transformation function

The operation of center clipping is depicted in Figure 4.3

75

Figure 4.3 Centre clipping affects a speech waveform

It can be seen that for samples above CL, the output of the centre

clipper is equal to the input minus the clipping level. For samples below the

clipping level, the output is zero. For high clipping levels, fewer peaks will

exceed the clipping level and thus fewer pulses will appear in the output. If

the clipping level is decreased, more peaks pass through the clipper and the

auto correlation function becomes more complex (Rabiner 2004).

The problem of extraneous peak can be eliminated in the

autocorrelation function by center clipping prior to computing the

autocorrelation function. However another difficulty with autocorrelation

function representation is that large amount of computation that is required. A

simple modification to centre clipping function leads to greater simplification

in autocorrelation computation. The output of the clipper is +1 if x(n) > + CL

76

and -1 if x(n) < - CL. Otherwise the output is zero. The computation of the

autocorrelation function for a 3-level center clipped signal is particularly

simple. Most of the extraneous peaks are eliminated, and a clear indication of

periodicity is retained. The three level center clipping function is shown in

Figure 4.4.

Figure 4.4 Three level center clipping function

A novel algorithm for estimating the pitch period from the short-

time autocorrelation function proposed by Dubnowski et at (1976). The steps

in the pitch based VAD algorithm is given below:

i. The speech signal is filtered with a 900 Hz low pass analog

filter and sampled at a rate of 10 kHz.

ii. Segments of length 30msec are selected at 10msec intervals.

iii. Using the clipping level, the speech signal is processed by a 3-

level centre clipper and the correlation function is computed

over a range spanning the expected range of pitch periods.

iv. The largest peak of the autocorrelation function is located and

the peak value is compared to a fixed threshold. If the peak

falls below threshold, the segment is classified as unvoiced

else the segment is voiced.

77

4.2.4 Energy Based Detector (EBD)

The amplitude of the speech signal varies appreciably with time.

The amplitude of unvoiced segments is generally much lower than that of

voiced segments. The energy of a signal represents a convenient

representation that reflects the amplitude of the signal. Energy of a frame

indicates the possible presence of voice data and is an important parameter

used in VAD algorithms.

Let X(i) be the ith sample of speech. If the length of the frame

were k samples, then the jth frame can be represented in time domain by a

sequence as

= { ( )} ( ) (4.9)

Ej represents the energy of the jth frame as,

= ( )( ) (4.10)

The VAD algorithm is trained for a small period by a prerecorded

sample that contains only background noise. The initial threshold for various

parameters is computed from these samples. The initial energy theorem is

obtained by taking the mean of the energies (Em ) of the samples

= = (4.11)

E is the initial threshold estimate and is the number of frames in

a prerecorded sample, and the initial 20 frames are considered as INACTIVE.

78

The classification rule for speech is as follows,

if > k (k > 1) frame is ACTIVE

else frame is INACTIVE (4.12)

Here, represents the energy of noise frame, while k is the

threshold being used in the decision making. Active frames are transmitted

while Inactive frames are not transmitted.

Energy based decisions are not good for low energy phonemes.

Weak fricatives are sometimes silenced completely. High energy voiced

speech segments are detected in all VAD algorithms even under noise

conditions. However, low energy unvoiced speech is commonly missed,

reducing speech quality.

4.2.5 Subband OSF Based VAD

Javier Ramirez et al (2005) proposes the determination of the

speech / nonspeech divergence by means of specialized Order Statistics Filter

(OSF) working on the subband log-energies. The filters based on order

statistics have been successfully employed in restoration of signals and

images corrupted by additive noise. The most common OSF is the median

filter that is easy to implement and exhibits good performance in removing

impulsive noise.

Figure 4.5 enumerates the block diagram of the subband based

VAD. This algorithm operates on the subband log-energies. Noise reduction

is performed first and the VAD decision is formulated on the de-noised

signal. The noisy speech signal is decomposed into 25 ms frames with a 10

ms window shift. Let X (m,l) be the spectrum magnitude for the mth band at

frame l .The design of the noise reduction block is based on Wiener Filter

79

(WF) theory whereby the attenuation is a function of the Signal to Noise

Ratio (SNR) of the input signal. The VAD decision is formulated in terms of

the de-noised signal. The subband log-energies are processed by means of

order statistics filters.

) ( ) ( ) )

) )

1)

Figure 4.5 Block diagram of Subband OSF based VAD

The noise reduction block consists of four stages.

i) Spectrum smoothing: The power spectrum is averaged over

two consecutive frames and two adjacent spectral bands.

ii) Noise estimation: The noise spectrum ) is updated by

means of a 1st order IIR filter on the smoothed

spectrum ),

) = 1) + ( ) ) (4.13)

where =0.99 and =0,1,…,NFFT/2, (NFFT= Nonequispaced FFT)

FFTNOISEREDUCTION

VAD

SPECTRALSMOOTHING

WFDESIGN

FREQUENCYDOMAINFILTERING

NOISEUPDATE

80

iii) Wiener Filter (WF) design: First, the clean signal ) is

estimated by combining smoothing and spectral subtraction

) = ’ 1) + (1 ),0)

(4.14)

where = 0.98 .

Then, the WF ) is designed as

( ) = ( ) ( )

(4.15)

where

( ) = max ( ) ( )

, (4.16)

and is selected so that the filter yields a 20 dB maximum attenuation.

’ ), the spectrum of the cleaned speech signal, is assumed to be zero at

the beginning of the process and is used for designing the WF through

equation (2.13) to equation (2.15). It is given by

’ ) = ) (4.17)

The filter ) is smoothed in order to eliminate rapid changes

between neighbor frequencies that may often cause musical noise. Thus, the

variance of the residual noise is reduced and consequently, the robustness

when detecting nonspeech is enhanced. The smoothing is performed by

truncating the impulse response of the corresponding causal FIR filter to 17

taps using a Hanning window. With this operation performed in the time

domain, the frequency response of the Wiener Filter is smoothed and the

performance of the VAD is improved.

81

iv) Frequency domain filtering: The smoothed filter is applied

in the frequency domain to obtain the denoised spectrum

) = ). (4.18)

Once the input speech has been de-noised, the log-energies for the

lth frame, ), in subbands ( = 0,1, … . . 1) are computed by means

of

E( ) = logK

NFFT(Y ( ) )

= k= 0,1,…K-1 (4.19)

where an equally spaced subband assignment is used.

The algorithm uses two OSF for the multiband quantile (MBQ)

SNR estimation. A first OSF estimates the subband signal energy by means of

) = ( ) ) ) + ) ) (4.20)

where ) is the p sampling quantile, = [ 2 ] and = 2 .

Finally, the SNR in each subband is measured by

) = ) (4.21)

where ) is the noise level in the kth band that needs to be estimated. For

the initialization of the algorithm, the first N frames are assumed to be

nonspeech frames and the noise level in the kth band, ), is estimated as

the median of the set (0, ), (1, ), … 1, )}. In order to track

82

nonstationary noisy environments, the noise references are updated during

nonspeech periods by means of a second OSF (a median filter)

) = ) + ( ) ), k=0,1,…..,K-1 (4.22)

where ), is the output of the median filter and =0.97 was

experimentally selected. On the other hand, the sampling quantile p=0.9 is

selected as a good estimation of the subband spectral envelope.

The decision rule is then formulated in terms of the average

subband SNR

( ) = QSNR( ) (4.23)

If the SNR is greater than a threshold , the current frame is

classified as speech, otherwise it is classified as nonspeech. It is assumed that

the system will work at different noisy conditions and that an optimal

threshold can be determined for the system working in the cleanest ( ) and

noisiest conditions ( ). Thus, the threshold is adaptive to the measured full-

band noise energy

=<

( ) (4.24)

thus enabling the VAD selecting the optimum working point for different

SNR conditions. Note that, the threshold is linearly decreased as the noise

level is increased between (E , )and (E , ) which represent optimum

thresholds for the cleanest and noisiest conditions defined by the noise

energies E and , respectively.

83

4.3 DRAWBACKS OF EXISTING ALGORITHMS

The existing algorithm is based on the assumption that noise

spectrum does not significantly vary within a N frame of the neighborhood of

the lst frame. However, this is not true in the case of highly stationary noise.

Noise estimation of the first frame is used to denoise 8 frames forward. Noise

estimate is very low for the first frame. So the algorithm fails at the beginning

to evaluate the noise spectrum and the detection afterwards could be totally

erroneous. The existing algorithm also fails to update the threshold in low

noise conditions. This will degrade the performance of VAD.

4.3.1 Proposed Algorithm

The proposed algorithm does not depend on the feedback loop for

noise spectrum estimation. Instead it uses a noise estimation algorithm which

updates noise for every frame. This method of noise estimation is best suited

for highly non stationary environments, thus increasing the robustness as

discussed in Sundarrajan Rangachari et al (2004).

) ) ) )

) )

Figure.4.6 Block diagram of proposed VAD

FFT NOISEREDUCTION VAD

SPECTRALSMOOTHING

WFDESIGN

FREQUENCYDOMAINFILTERING

NOISEUPDATE

84

The noise estimate is updated by averaging the noisy speech power

spectrum using a time and frequency dependent smoothing factor, which is

adjusted based on signal presence probability in subbands. It improves the

speech/non-speech discriminability and speech recognition performance in

noisy environments. Two problems are solved using VAD. The first one is

performance of VAD in low noise condition and the second is with noisy

environment. The block diagram of proposed VAD is shown in Figure 2.6.

The noise estimation algorithm is as follows

The smoothed power spectrum of the noisy speech signal is

estimated using a first-order recursive formula as

) = 1, ) + ( )| ( )| (4.25)

where |Y( , k)| is an estimate the short time power spectrum of noisy

speech and is the smoothing constant, where is the frame index and k is

the frequency bin index.

Since the noisy speech power spectrum in the speech absent frames

is equal to the power spectrum of the noise, we can update the estimate of the

noise spectrum by tracking the speech absent frames. To compute the ratio of

the energy of the noisy speech power spectrum in three different frequency

bands (low: 0-1kHz, middle: 1-3 kHz, high: 3 kHz and above) to the energy

of the corresponding frequency band in the previous noise estimate. The

following three ratios are computed:

( ) = ( )( )

(4.26)

( ) = ( )( )

(4.27)

85

( ) = ( )

( ) (4.28)

where ) is the estimate of the noise power spectrum at frame , and

Low Frequency, Medium Frequency, Fs correspond to the frequency bins of

1 kHz, 3 kHz and the sampling frequency respectively. The speech frame is

classified as speech present or speech absent in the following manner. The

incoming frame is classified as speech absent frame if the following condition

is satisfied

( ) < ( ) < ( ) (4.29)

where is threshold. The speech-absent frame and the noise estimate is

updated according to

( ) ( 1, ) + ( )| ( )| (4.30)

where is a smoothing constant. If any or all of the above three ratios are

larger than the threshold , then a different algorithm is used for updating and

estimating the noise spectrum.

In case of speech present frames, noise updation is as follows:

Frequency bins are classified as speech present or absent by

tracking the local minimum of noisy speech and then speech presence in each

frequency bin is decided separately using the ratio of noisy speech power to

its local minimum. A different non-linear rule is used for tracking the

minimum of the noisy speech by continuously averaging the past spectral

values.

86

if ( 1, ) < )

then

( ) = ( 1, ) + ( ( ) ( 1, )) (4.31)

else

( ) = )

where ( ) is the local minimum of the noisy speech power spectrum

and and are constants whose values are determined experimentally.

Let ( ) )/ ) denote the ratio between the

energy of the noisy speech to its local minimum. This ratio is compared

against a frequency-dependent threshold and if it is found to be larger than

that threshold, then the corresponding frequency is considered to contain

speech.

Using the above ratio ), the new frequency-dependent

smoothing constant can be estimated as follows:

( ) =( ) ( )

(4.32)

where , are smoothing constants ( , ) and ( ) is a frequency-

dependent threshold given as

( ) =1.3 1

3 5 /2

(4.33)

87

Finally, after computing the frequency-depending smoothing factor

s ( ,k) the noise spectrum estimate is updated according to

N( ,k)= s( ,k)N(( -1,k)+(1- s ( ,k) t))|Y ( ,k)|2 (4.34)

4.4 RESULTS AND DISCUSSIONS

The proposed structure for increasing the recognition accuracy of

the robust speech recognition system using VAD algorithms is shown in the

Figure 4.7. The system consists of two main parts, preprocessor and ASR.

The preprocessor includes Voice Activity Detector (VAD). VAD identifies

the presence or absence of speech and extracts the speech from the noise

corrupted speech.

Figure 4.7 Structure of speech recognition system

Figure 4.8 shows the original clean speech signal. Figure 4.9 shows

the output of the existing algorithm. Original signal corrupted by airport noise

of SNR 0 dB is given as input. Due to false estimation of noise spectrum the

algorithm fails at the beginning of the utterance itself. So most of the noise

only frames are classified as speech present frames. Figure 4.10 shows the

output of the proposed algorithm. The speech frames are extracted correctly

from the noisy speech signal.

One hundred words were taken for speech recognition (using

isolated word recognition with statistical modeling - Hidden Markov Model),

Input speech NoiseEstimation VAD ASR Recogniton

Accuracy

88

after adding various noise environments. We have analyzed input word

utterance under the most commonly encountered noise environments like

suburban train noise, babble, car, exhibition hall, restaurant, street, airport and

train-station noise were taken from the AURORA database.

In the training phase, the uttered words of 100 samples each digits

0-9, both male and female voice (age from 15-25) are recorded using 8-bit

Pulse Code Modulation (PCM) with a sampling rate of 8 kHz from single

channel input and saved as a wave file using sound recorder software

The proposed framework uses a speech processing module includes

the Hidden Markov Model (HMM)-based classification and noise language

modeling to achieve effective noise knowledge estimation which was

discussed in chapter 2.

The performance of ASR was analyzed under noisy conditions and

the same was analyzed using VAD and the accuracy in percentage is shown in

the Figure 4.11. The Subband Order Statistics Filter (OSF) method algorithm

performs better than other VAD algorithms. And the recognition accuracy of

all VAD algorithms can be improved if we consider noise estimation in the

non-stationary environment. This chapter presented a proposed structure of

Speech Recognition Systems with Subband Order Statistics Filter (OSF)

improving speech detection robustness in noisy environments. The approach

is based on an effective endpoint detection algorithm employing noise

reduction techniques and order statistic filters for the formulation of the

decision rule.

The Automatic speech recognition systems work reasonably well

under clean conditions but become fragile in practical applications involving

real-world environments.

89

Figure 4.8 Original clean speech signal

Figure 4.9 Output of existing VAD

0 1000 2000 3000 4000 5000 6000 7000-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6Original signal without noise

0 1000 2000 3000 4000 5000 6000 7000-0.5

0

0.5

1Noisy input signal

90

Figure 4.10 Output of proposed algorithm

Table 4.1 through Table 4.8 depicts the performance of Subband

Order Statistics Filter (OSF) based Voice Activity Detection of Ramirez and

proposed algorithm for under various noise conditions in terms improvement

of Recognition Accuracy (RA).

From Tables 4.1 and 4.2 it was observed that the ASR with VAD in

presence of Babble noise source performed better with 20.81% of

improvement in RA compared to existing algorithm for SNR at 0 dB. The

speech recognition accuracy of proposed algorithm has an improvement of

13.54% in RA when compared to the algorithm proposed by the Ramirez et al

(2004) in presence of various noise sources.

0 1000 2000 3000 4000 5000 6000 7000-0.5

0

0.5

1Noisy input signal

91

92

From Tables 4.3 and 4.4 it was found that the in presence of

Exhibition noise with 5 dB SNR noise level the proposed algorithm

performed better with 11.71% of improvement in RA. The proposed

algorithm shows an improvement in percentage of RA as 8.07% when

compared to the existing algorithm (Ramirez et al).

From Tables 4.5 and 4.6 it was observed that the proposed

algorithm with 10 dB noise level for Train noise source shows an

improvement RA of 8.27%. The existing algorithm has an average RA of

80.01%, and the proposed algorithm has got an average RA of 85.18%.

From Tables 4.7 and 4.8 it was inferred that in presence the Airport

noise source at 15 dB level for proposed algorithm performed better with

5.64% of improvement RA. The proposed algorithm was having an

improvement RA of 3.67% when compared to the existing algorithm.

93

94

95

96

Table 4.9 shows the performance of ASR. The proposed method

performs better with maximum improvement of 20.81% RA for Babble noise

and with a minimum improvement of 2.26% RA for Street noise. The overall

performance analysis of the existing VAD algorithm with proposed algorithm

is shown in the Table 4.10

Table 4.9 Overall performance analysis of proposed VAD algorithm in

terms of % improvement in RA

PercentageImprovement 0dB 5dB 10dB 15dB

BetterBabble

(20.81 %)

Exhibition

(11.71 %)

Train

(8.27 %)

Airport

(5.64 %)

LeastAirport

(6.33 %)

Airport

(6.13 %)

Babble

(4.21 %)

Street

(2.26 %)

Table 4.10 Overall performance analysis of VAD Algorithms

VADMethod

0dB( % Accuracy)

5dB( % Accuracy)

10dB( % Accuracy)

15dB( % Accuracy)

EBD 18.20 28.80 29.35 39.66

ZCD 20.30 25.00 30.62 41.92

WFD 19.40 29.42 31.88 40.12

PBD 17.45 22.25 34.82 42.25

Ramirez et al 61.23 73.59 80.08 90.2

Proposed 70.89 80.05 85.18 93.64

97

The VAD recognition accuracy of different SNR values for the

Subband OSF based VAD and Proposed method are shown in the Figure 4.11.

It was observed that better recognition occurred for Restaurant noise

(84.225%) and least recognition for Exhibition noise (78.625%)

Figure 4.11 Comparison of Ramirez et al and proposed VAD method for

various noise environments

The proposed VAD works well for non-stationary signal. In most of

the speech enhancement schemes the noise signal is suppressed and speech

signal is enhanced. In our proposed VAD algorithm a new noise estimation

algorithm is presented along with the OSF which improves the quality as well

the RA of the speech recognition system.

4.5 CONCLUSION

The algorithms based solely on energy did not give an acceptable

Speech Recognition Accuracy with all the test templates. The other

techniques (Autocorrelation function and Zero Crossing Detection) gave

better Speech Recognition Accuracy. The ZCD was used to recover some low

energy phonemes that were rejected by the energy-based detector. However, it

also picked up certain noise frames that matched the Zero Crossing criteria.

657075808590

% o

f RA

Noise Sources

Overall % of RA for Proposed and Existing VAD

Proposed VADExisting VAD

98

WFD technique performed better than ZCD in detection of weak fricatives. A

pitch based detection algorithm is an algorithm designed to estimate the pitch

or fundamental frequency of a quasi periodic or virtually periodic signal. The

performance of PBD is different from other techniques. It produces better

performance as same as the WFD.

The proposed method for combining the noise estimation

algorithms and VAD algorithms, so that improved speech recognition

accuracy performance can be obtained under these noise conditions.

This chapter, presented a proposed structure of Speech Recognition

Systems with Subband Order Statistics Filter (OSF) improving speech

detection robustness in noisy environments. The approach is based on an

effective endpoint detection algorithm employing noise reduction techniques

and order statistic filters for the formulation of the decision rule. The

proposed algorithm performs better in the case of non stationary noise than

the existing algorithm.

CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMSshodhganga.inflibnet.ac.in/bitstream/10603/26451/9/09_chapter4.pdf · In this chapter, the Voice Activity Detection (VAD) developed by

Documents