Top Banner
May 30th, 2006 Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International Computer Science Institute
25

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Features for Improved Speech Activity Detection for Recognition

of Multiparty Meetings

Kofi A. Boakye

International Computer Science Institute

Page 2: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Overview

● Background● Previous work and Proposed changes● HMM segmenter and ASR System● Features Investigated● Experimental Results● Conclusions

Page 3: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Background

● Segmentation of audio into speech/nonspeech is a critical first step in ASR

● Especially true for Individual Headset Microphone (IHM) condition in meetings– Issues:

1) Crosstalk2) Breath/contact noise

– Single-channel energy-based methods ineffective

Page 4: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Background

● Initiatives such as AMI, IM2, and the NIST RT eval show interest in recognition and understanding of multispeaker meetings

Page 5: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Background

● Major source of error for IHM recognition: speech activity detection errors

Page 6: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Previous Work● Previous approach: Time-based

intersection of two distinct segmenters

1) HMM-based segmenter with standard cepstral features– 12 MFCCs– Log-Energy– First and second differences

Page 7: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Previous Work● Previous approach: Time-based

intersection of two distinct segmenters

2) Local-energy detector– Generates segments by zero-thresholding

“crosstalk-compensated” energy-like signal

Page 8: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Proposed Changes● Though intersection approach was

effective, it was believed to be limited– Cross-channel analysis disjoint from

speech activity modeling– Fixed threshold potentially lacks

robustness– Fails to incorporate other acoustically

derived features (e.g., cross-correlation) ● New approach: integrate features

directly into HMM segmenter– Append features to cepstral feature vector

Page 9: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

HMM Segmenter● Derived from HMM-based speech

recognition system● Two-class HMM with three-state phone

model● Multivariate GMM with 256 components● Segmentation proceeds by repeatedly

decoding waveform with decreasing transition penalties– Results in segments less than 60s

Page 10: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

HMM Segmenter

● Post-processing– Pad segments by a fixed amount (40ms) to

prevent “clipping” effects– Merge segments with small separation (<

0.4s) to “smooth” segmentation– Constraints optimized based on

recognition accuracy and runtime for segmenter with standard cepstral features

Page 11: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

ASR System

● For development and validation experiments ICSI-SRI RT-05S system used– Multiple decoding passes and front-ends

for cross-adaptation and hypothesis refinement

– PLP and MFCC+MLP features– Features transformed with VTLN and HLDA

along with feature-level constrained MLLR– Models trained on 2000 hours of telephone

data and MAP adapted to 100 hours of meeting data

– 4-gram LM trained with telephone, meeting transcripts, broadcast, and Web data

Page 12: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Features: Cross-channel●Log-Energy Differences (LEDs)

– Log of ratio of short-time energy between target and each non-target channel

●Normalized Log-Energy Differences– Subtract minimum frame energy of a

channel from all energy values in the channel

– Addresses significant gain differences– Largely independent of amount of speech

in channel

imin,inorm EnE=nE

Page 13: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Features: Cross-channel●Normalized Maximum Cross-correlation

(NMXC)

– Serves as an indicator of crosstalk– More common cross-channel feature than

energy-differences

0jj

ijτij φ

τφmax=Γ

Page 14: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Features: Cross-channel● Feature vector length standardization

– For cross-channel features, number of channels may vary, but feature vector length must be fixed

– Proposed solution: use order statistics (maximum and minimum) of the feature values generated on the different channels

Page 15: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Experiments: AMI devtest

● Performance of features initially investigated on AMI development set

● Testing– 12-minute excerpts from 4 meetings

● Training– First 10 minutes from each of 35 meetings

● “Fast” (two-decoding pass) version of recognition system used for quick turnaround

Page 16: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Experiments: AMI devtest

● Results

Del Subs Ins WER0

5

10

15

20

25

30

35

40

Recognition Performance

baselinebase + LEDsbase + NLEDsbase + NMXCreference

System Del Subs Ins WERbaseline 17.4 13.0 7.4 37.8base + LEDs 17.4 12.8 4.5 34.7

17.1 12.0 4.4 33.5base + NMXC 17.4 12.1 4.5 34.1reference 18.3 10.2 3.4 32.0

base + NLEDs

New features give significant improvement over baseline

●Reduced insertions

NLEDs give ~1% reduction over LEDs

Page 17: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Experiments: Eval04

● Having established effectiveness of features, systems were evaluated using RT-04S set

● Meetings vary in style, number of participants, and room acoustics

● Testing– 11-minute excerpts from 8 meetings, 2

from each of CMU, ICSI, NIST, and LDC● Training

– First 10 minutes from each of 15 NIST meetings and 73 ICSI meetings

Page 18: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Experiments: Eval04

● ResultsSystem WER

ALL CMU ICSI NIST LDCbaseline 29.6 33.1 23.4 20.0 38.7intersection 27.9 32.5 21.4 20.2 34.9base + LEDs 27.3 32.8 20.1 20.0 33.7

26.9 32.8 18.5 19.6 34.0base + NMXC 28.1 31.7 24.9 19.0 33.8reference 25.1 30.3 18.0 17.0 31.9

base + NLEDs

ALL CMU ICSI NIST LDC0

5

10

15

20

25

30

35

40

45

Recognition Performance

baseline

intersection

base + LEDs

base + NLEDs

base + NMXC

reference

WE

RFeatures give improvement over baseline and previous system

NMXC features not as robust● Removed from consideration for final SAD system

Page 19: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

System Validation: Eval05 (and 06)

● Finalized system: HMM segmenter with baseline and NLED features*

● Training– Union of previous training sets

● AMI (35 mtgs), NIST (15 mtgs), ICSI (73 mtgs)– Baseline and intersection systems used

two models (ICSI+NIST and AMI)– New systems used single model with

pooled data

*Eval06 official submission used LEDs

Page 20: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

System Validation: Eval05 (and 06)

21.4

25.2

33.0

37.3

NIST

19.5

22.7

24.7

25.6

ALL

WER

19.2

21.9

22.0

AMI

19.9

23.1

23.5

CMU

16.8

20.6

20.9

ICSI

20.6

22.9

23.8

VT

NLEDs+SDM

LEDs +SDM

Segmenter Method

Reference

LEDs

● Using the SDM signal – Eval05 included a meeting with an unmiked participant– SDM served as “stand-in” mic for participant– Including the SDM signal (and energy normalization)

improved results by >12% on NIST meetings!– The SDM signal was not used for eval06 since there were no

unmiked speakers

Page 21: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

System Validation: Eval05 (and 06)

19.5

22.7

24.7

25.9

29.3

WER

eval05

11.2

10.9

11.1

11.0

11.0

Sub

6.7

10.2

10.2

11.5

10.3

Del

1.6

1.6

3.3

3.4

8.0

Ins

20.2

22.8

24.0

WER

eval06

NLEDs

LEDs

System

Reference

intersection

baseline

● 1.2% gain over last year’s segmenter on eval05● Energy normalization gave extra 1.2% gain on eval06,

2.0% on eval05 (due to unmiked speaker in NIST meeting)

Page 22: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Additional Experiments: MLP Features

● Use features as inputs to Multi-Layer Perceptron to see if additional gains can be made

● Training– Inputs consist of baseline and either LED or

NLED features (41 components)– Input context window of 11 frames and

400 hidden units– 90/10 split for cross-validation

Page 23: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Additional Experiments: MLP Features

● Amidevtest Results

27

28

29

30

31

32

33

34

35

36

Recognition Performance

base + LEDs

base + NLEDs

MLP + LEDs

MLP + NLEDs

base + MLP + LEDs

base + MLP + NLEDsWE

R

System WERbase + LEDs 34.7

33.5MLP + LEDs 33.9

34.4base + MLP + LEDs 34.7

35.0

base + NLEDs

MLP + NLEDs

base + MLP + NLEDs

MLP with LEDs better thanwith NLEDs

Addition of baseline features degrades performance

No combination outperforms NLED features

Page 24: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Conclusions● Integrating cross-channel analysis with

speech activity modeling yields large WER reductions

● Simple cross-channel energy-based features perform very well and are more robust than cross-correlation based features

● Minimum energy subtraction produces still further gains

● Inclusion of omnidirectional mic allows crosstalk suppression even from speakers without dedicated microphones

● Still room for improvement as significant gap (>2%) exists between automatic and ideal segmentation

Page 25: May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

May 30th, 2006 Speech Group Lunch Talk

Fin