Top Banner
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augusti ne Tsai, and Qiru Zhou, Member, IEEE Presented by Chen Hung_Bin
27

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Jan 02, 2016

Download

Documents

maggy-herrera

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition. Qi Li , Senior Member, IEEE , Jinsong Zheng, Augustine Tsai, and Qiru Zhou , Member, IEEE Presented by Chen Hung_Bin. outline. Introduction endpoint detection Endpoint detection include - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

1

Robust Endpoint Detection and Energy Normalizationfor Real-Time Speech and Speaker Recognition

Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, and Qiru Zhou, Member, IEEE

Presented by Chen Hung_Bin

Page 2: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

2

outline

• Introduction endpoint detection• Endpoint detection include• Endpoint detection (Filter)• State Transition• Experiment

Page 3: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

3

Introduction

• The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection, speech detection, or speech activity detection.

• In this paper, address endpoint detection by sequential and batch-mode processes to support real-time recognition.– sequential: automatic speech recognition (ASR)– batch-mode: utterances are usually as short as a few seconds an

d the delay in response is usually small.

Page 4: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

4

Introduction

• Endpoint detection include– energy threshold– pitch detection– spectrum analysis– cepstral analysis – zero-crossing rate– periodicity measure– chi-square test– entropy– hybrid detection

Page 5: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

5

Introduction

• energy

N

i

N

i

N

i

ixdb

ixenergy

ixmagnitude

1

210

1

2

1

][log10

dbin measured is e(t) of value thefrequentlyBut,

][

][

Page 6: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

6

Introduction

• A Mandarin digit “eight.”

• spectrum

Page 7: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

7

Introduction

• zero-crossing rate

Page 8: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

8

Introduction

• The chi-square test given by

• The hypothesis test can thus be written as

N

i i

ii

e

eo

1

22

1

02

H

Hthresthold

Page 9: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

9

Introduction

• entropy

kk

k xPxPxH /1log

Page 10: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

10

Introduction

• endpoint detection crucial : accuracy and speed for several reasons.

– It is hard to model noise and silence accurately in changing environments.

– if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech.

– The cepstral mean subtraction (CMS), a popular algorithm for robust speech recognition, accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy.

Page 11: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

11

Introduction

• point out in this study :– The more accurately we can detect endpoints, the better we can

do on real-time energy normalization.

• requirements: – Accurate location of detected endpoints; – Robust detection at various noise levels; – Low computational complexity; – Fast response time;– And simple implementation.

Page 12: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

12

Endpoint Detection (Filter)

• First, we need a detector (filter) that meets the following general requirements:– 1) invariant outputs at various background energy levels;– 2) capability of detecting both beginning and ending points;– 3) short time delay or look-ahead;– 4) limited response level;– 5) maximum output signal-to-noise ratio (SNR) at endpoints;– 6) accurate location of detected endpoints;– 7) maximum suppression of false detection.

Page 13: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

13

Endpoint Detection (Filter)

sx

Ax

Ax

In

nj

w

wi

eKK

AxKAxKe

AxKAxKexf

sxAwifiwfih

o(j)g(t)

w

i

t

itgihtF

t

t

65

43

21

12

10

)cos()sin(

)cos()sin()(

papameters are ,,{ )1(),0()(

(db) featureenergy {the log10

filter theof width half theis

integeran is

number framecurrent theis

)()()(

filter average moving a as operated becan then filter The

Page 14: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

14

Filter for Both Beginning- and Ending-Edge Detection

• choose the filter size– W =13

– s = 0.5385

– A = 0.2208

• Let H(i)=h(i-13); then the filter has 25 points in total with a 24-frame look-ahead since H(1) both H(25) and are zeros.

24

2

)2()()(i

itgiHtF

872,-0.56]-0.036,-0.68,-0.078,[1.583,1.4]k[k 61

Count 30Less then 25 points

Page 15: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

15

Filter for Both Beginning- and Ending-Edge Detection

• In this paper choose the filter size

Shape of the optimal filter for beginning edge detection, plotted as h (t), with W = 7 and s = 1

Shape of the optimal filter for ending edge detection, plotted as h (t), with W = 35 and s = 0:2.

Page 16: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

16

Batch-mode Endpoint Detection

Lines E, F, G, and H indicate the locations of two pairs of beginning and ending points.

Output of the beginning-edge filter (solid line) and ending-edge filter (dashed line)

Page 17: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

17

Batch-mode Endpoint Detection

Page 18: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

18

State Transition Diagram

• Using a three-state transition diagram to make final decisions.– silence, in-speech, and leaving-speech.

8 KHz sampling rate

State transition diagram for endpoint decision. (a) energy contour of digit “4” (b) filter outputs and state transitions.

Page 19: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

19

Real-Time Energy Normalization

• Purposing of energy normalization is to normalize the utterance energy g(t), such that the largest value of energy is close to zero.

});1(ˆ),2(max{)(ˆ

as (t)g update

Fig.in shown as N toM from is windowahead-look the

}2);(max{)(ˆ

g estimate tohow

)((t)~

maxmax

max

max

max

max

MttgWtgtg

WMtMtgtg

gtgg

Page 20: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

20

Real-Time Energy Normalization

mgWMtMtgEtg }2);({)(

click. single a fromnot is g new

that ensure to thresholdselected-pre a tobeg a need we,But

max

m

Page 21: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

21

Real-Time Energy Normalization

• example

(a) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR).

(b) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (c) Detected endpoints and normalized energy for the 20 dB SNR case and (d) for the 5 dB SNR case.

Page 22: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

22

Database Evaluation

• The proposed algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases.

• Baseline Endpoint Detection:– six-state transition diagram is used

• initializing, silence, rising, energy, fell-rising, and fell states.

– In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition.

Page 23: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

23

Database Evaluation

• Noisy Database Evaluation:– In this experiment, a database was first recorded from a desktop

computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate.

– Car and other back ground noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB.

– The original database has 39 utterances and 1738 digits in total.– LPC feature and the short-term energy were used and the

hidden Markov model (HMM) to recognize.

Page 24: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

24

Database Evaluation

Comparisons on real-time connected digit recognition

(a) utterance in DB5: “1 Z 4 O 5 8 2.”(b) baseline, recognized as “1 Z 4 O 5 8.” (c) proposed, recognized as “1 Z 4 O 5 8 2.”(d) filter output

Page 25: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

25

Database Evaluation

• Telephone Database Evaluation:– The proposed algorithm was further evaluated in 11 databases

collected from the telephone networks with 8 kHz sampling rates in various acoustic environments.

– DB1 to DB5 contain digits, alphabet and word strings.– DB6 to DB11 contain pure digit strings.– In the proposed system, we set the parameters as

30)( and 0.3,6.3,60,800 countCapTTgg LUm

Page 26: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

26

Database Evaluation

digits, alphabet andword strings

pure digit strings

Page 27: Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

27

CONCLUSIONS

• Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation.