Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

1

Robust Endpoint Detection and Energy Normalizationfor Real-Time Speech and Speaker Recognition

Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, and Qiru Zhou, Member, IEEE

Presented by Chen Hung_Bin

2

outline

• Introduction endpoint detection• Endpoint detection include• Endpoint detection (Filter)• State Transition• Experiment

3

Introduction

• The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection, speech detection, or speech activity detection.

• In this paper, address endpoint detection by sequential and batch-mode processes to support real-time recognition.– sequential: automatic speech recognition (ASR)– batch-mode: utterances are usually as short as a few seconds an

d the delay in response is usually small.

4

Introduction

• Endpoint detection include– energy threshold– pitch detection– spectrum analysis– cepstral analysis – zero-crossing rate– periodicity measure– chi-square test– entropy– hybrid detection

5

Introduction

• energy

N

i

N

i

N

i

ixdb

ixenergy

ixmagnitude

1

210

1

2

1

][log10

dbin measured is e(t) of value thefrequentlyBut,

][

][

6

Introduction

• A Mandarin digit “eight.”

• spectrum

7

Introduction

• zero-crossing rate

8

Introduction

• The chi-square test given by

• The hypothesis test can thus be written as

N

i i

ii

e

eo

1

22

1

02

H

Hthresthold

9

Introduction

• entropy

kk

k xPxPxH /1log

10

Introduction

• endpoint detection crucial : accuracy and speed for several reasons.

– It is hard to model noise and silence accurately in changing environments.

– if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech.

– The cepstral mean subtraction (CMS), a popular algorithm for robust speech recognition, accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy.

11

Introduction

• point out in this study :– The more accurately we can detect endpoints, the better we can

do on real-time energy normalization.

• requirements: – Accurate location of detected endpoints; – Robust detection at various noise levels; – Low computational complexity; – Fast response time;– And simple implementation.

12

Endpoint Detection (Filter)

• First, we need a detector (filter) that meets the following general requirements:– 1) invariant outputs at various background energy levels;– 2) capability of detecting both beginning and ending points;– 3) short time delay or look-ahead;– 4) limited response level;– 5) maximum output signal-to-noise ratio (SNR) at endpoints;– 6) accurate location of detected endpoints;– 7) maximum suppression of false detection.

13

Endpoint Detection (Filter)

sx

Ax

Ax

In

nj

w

wi

eKK

AxKAxKe

AxKAxKexf

sxAwifiwfih

o(j)g(t)

w

i

t

itgihtF

t

t

65

43

21

12

10

)cos()sin(

)cos()sin()(

papameters are ,,{ )1(),0()(

(db) featureenergy {the log10

filter theof width half theis

integeran is

number framecurrent theis

)()()(

filter average moving a as operated becan then filter The

14

Filter for Both Beginning- and Ending-Edge Detection

• choose the filter size– W =13

– s = 0.5385

– A = 0.2208

–

• Let H(i)=h(i-13); then the filter has 25 points in total with a 24-frame look-ahead since H(1) both H(25) and are zeros.

24

2

)2()()(i

itgiHtF

872,-0.56]-0.036,-0.68,-0.078,[1.583,1.4]k[k 61

Count 30Less then 25 points

15

Filter for Both Beginning- and Ending-Edge Detection

• In this paper choose the filter size

Shape of the optimal filter for beginning edge detection, plotted as h (t), with W = 7 and s = 1

Shape of the optimal filter for ending edge detection, plotted as h (t), with W = 35 and s = 0:2.

16

Batch-mode Endpoint Detection

Lines E, F, G, and H indicate the locations of two pairs of beginning and ending points.

Output of the beginning-edge filter (solid line) and ending-edge filter (dashed line)

17

Batch-mode Endpoint Detection

18

State Transition Diagram

• Using a three-state transition diagram to make final decisions.– silence, in-speech, and leaving-speech.

8 KHz sampling rate

State transition diagram for endpoint decision. (a) energy contour of digit “4” (b) filter outputs and state transitions.

19

Real-Time Energy Normalization

• Purposing of energy normalization is to normalize the utterance energy g(t), such that the largest value of energy is close to zero.

});1(ˆ),2(max{)(ˆ

as (t)g update

Fig.in shown as N toM from is windowahead-look the

}2);(max{)(ˆ

g estimate tohow

)((t)~

maxmax

max

max

max

max

MttgWtgtg

WMtMtgtg

gtgg

20


mgWMtMtgEtg }2);({)(

click. single a fromnot is g new

that ensure to thresholdselected-pre a tobeg a need we,But

max

m

21


• example

(a) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR).

(b) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (c) Detected endpoints and normalized energy for the 20 dB SNR case and (d) for the 5 dB SNR case.

22

Database Evaluation

• The proposed algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases.

• Baseline Endpoint Detection:– six-state transition diagram is used

• initializing, silence, rising, energy, fell-rising, and fell states.

– In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition.

23

Database Evaluation

• Noisy Database Evaluation:– In this experiment, a database was first recorded from a desktop

computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate.

– Car and other back ground noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB.

– The original database has 39 utterances and 1738 digits in total.– LPC feature and the short-term energy were used and the

hidden Markov model (HMM) to recognize.

24

Database Evaluation

Comparisons on real-time connected digit recognition

(a) utterance in DB5: “1 Z 4 O 5 8 2.”(b) baseline, recognized as “1 Z 4 O 5 8.” (c) proposed, recognized as “1 Z 4 O 5 8 2.”(d) filter output

25

Database Evaluation

• Telephone Database Evaluation:– The proposed algorithm was further evaluated in 11 databases

collected from the telephone networks with 8 kHz sampling rates in various acoustic environments.

– DB1 to DB5 contain digits, alphabet and word strings.– DB6 to DB11 contain pure digit strings.– In the proposed system, we set the parameters as

30)( and 0.3,6.3,60,800 countCapTTgg LUm

26

Database Evaluation

digits, alphabet andword strings

pure digit strings

27

CONCLUSIONS

• Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation.

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Documents

speech detection

robust endpoint detection

speech activity detection

endpoint detection filterfirst

address endpoint detection

robust speech recognition

optimal filter

presence of speech