Top Banner
1 Broadcast News Segmentation using Metadata and Speech-To- Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of Technology (EPFL) International Computer Science Institute (ICSI) March 16, 2004
22

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Jan 14, 2016

Download

Documents

Ashley Baldwin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

1

Broadcast News Segmentation using Metadata and Speech-To-Text

Informationto Improve Speech Recognition

Sebastien Coquoz,

Swiss Federal Institute of Technology (EPFL)

International Computer Science Institute (ICSI)

March 16, 2004

Page 2: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

2

Outline

General Idea ASR System used Exploratory work Strategies Results Conclusion

Page 3: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

3

General idea

Use Metadata (SUs) and Speech-To-Text (STT) information to improve later STT passes (feedback loop)

Page 4: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

4

Why segmentation?Why segment the audio stream?

• Important to give « linguistically coherent » pieces to the language model

• Remove « non-speech » (i.e. long silences, laughs, music, other noises,…)

Why use MDE?

• MDE gives information about sentence and speaker breaks

• Speaker labels improve the efficiency of the acoustic model and sentences improve the efficiency of the language model

• BBN’s error analysis of Broadcast News recognition revealed a higher error rate at segments boundaries

this may be caused by missing the true sentence boundaries

Page 5: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

5

Metadata and STT information

MDE object used:

Sentence-like units (SUs): express a thought or

idea. It generally corresponds to a sentence. Each SU has a

confidence measure, timing information (starting point and

duration) and a cluster label.

STT object used:

Lexemes: describe the words that were assumed to

be uttered. Each word has timing information (beginning

and duration).

Page 6: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

6

ASR system used

The system used is a simplified SRI BN evaluation system.

Recognition steps:

1. Segment the waveforms

2. Cluster the segments into « pseudo-speakers »

3. Compute and normalize features (Mel cepstrum)

4. Do first pass recognition with non-crossword acoustic models and bigram language model

5. Generate lattices

6. Expand lattices using 5-gram language model

7. Adapt acoustic models for each « pseudo-speaker »

8. Generate new lattices using the adapted acoustic models

9. Expand new lattices using 5-gram language model

10. Score the resulting hypotheses

Page 7: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

7

Types of segmentation

Baseline

• Classifies frames into « speech » and « non-speech » using a 2-state HMM

• Uses inter-words silences and speaker turns to segment the BN shows

MDE-based

• Uses sentence and speaker breaks to define an initial segmentation

• Further processes the segments using different strategies presented later

Baseline vs. MDE-based segmentation

Page 8: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

8

Baseline experiments

Comments:

• The baseline segmentation is the one presented above

• The results (shown later) obtained are:

• the current best results

• the baselines that ultimately have to be improved

• No additional processing step is applied to modify the segments

Page 9: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

9

« Cheating » experiments (1)

Why?

• See if there is room for improvement when using MDE-based segmentation

How?

• Use transcripts written by humans to segment the Broadcast News audio stream and apply processing strategies to improve recognition (i.e. use true information)

Page 10: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

10

« Cheating » experiments (2)

Results: Baseline vs. « Cheating » experiments

WERBaseline

seg Cheating 

seg(using SU)

Cheating seg

(SU+proc)

Wtd avg on 6

shows14.0 14.2 13.0

There is room for improvement!

Page 11: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

11

Overview of the processing steps

Broadcast News Shows

0. Segmentation using SUs

1. First strategy: splitting of long segments

2. Second strategy: concatenation of short segments

3. Third strategy: addition of time pads

Final segmentation

Page 12: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

12

First strategy: splitting of long segments

Why?

• Too long segments may cover more than 1 sentence confusing for the language model

How?

• Use automatically generated transcripts and MDE

• Too short segments mustn’t be processed bad for the efficiency of the language model

• Take two features into account for decision tree:

• The duration of segments

• The pause between words

Page 13: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

13

Second strategy: concatenation of short segments

Why?

• Short segments are not optimal for the language model

• Short segments increase the WER because all their words are close to the boundaries (cf. BBN’s error analysis)

How?

• Take 3 features into account for decision tree:

• Pause between segments

• Sum of the duration of two neighbors

• Cluster label

Page 14: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

14

Third strategy: Addition of time pads

Why?

• Prevent words from only being partially included

• Because the windowing in the front end has a scope of up to 8 frames (4 on each side) better to have enough padding

How?

• Take 1 feature into account for decision tree:

• The pause between segments

Page 15: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

15

Examples of improvements (1)

1) Real sentence: … and strictly limits state authority over how and when water is used …

Recognized sentence:

With baseline segmentation (cuts in middle of sentence):

… and stricter limits data arty over how and when watery hues …

Legend: segmentation point

red errors

time

time

With MDE-based segmentation:… and strict_ limits state authority over how and when water issues …

time

Page 16: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

16

Examples of improvements (2)

2) Real sentence: … I didn’t know if we would pull off the games. I didn’t know if this community

would ever rally around the Olympics again. …

Recognized sentence:

With baseline segmentation (doesn’t cut at end of

sentence):

… pull off the games that had not this community would ever rally around …

time

time

With MDE-based segmentation:

… pull off the game_ I didn’t know _ this community would ever rally around …

time

Page 17: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

17

Results for the development set

WER Baseline seg

Step 0: SU seg

SU seg + step1

Wtd avg on 6

shows

14.0 14.4 14.2

SU seg + steps 1 & 2

SU seg + steps 1 & 2

& 3

14.0 13.3

The improvement is 0.7% absolute and 5% relative!

Page 18: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

18

Results for the evaluation set

WER Baseline seg

Step 0: SU seg

SU seg + step1

Wtd avg on 6

shows

18.7 19.8 19.7

SU seg + steps 1 & 2

SU seg + steps 1 & 2

& 3

19.6 18.4

The improvement is 0.3% absolute and 1.6% relative!

Page 19: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

19

Dev results vs. Eval results

Observations:

• No « cheating » information available for the eval not sure how well the SU detection is working

• Improvements from step 0 (SU segmentation) to final segmentation are similar for dev set and eval set: 1.1% absolute (7.6% relative) for dev set and 1.3% absolute (6.6% relative) for eval set SU information not optimized for eval

• Respective improvements are quite uneven for each show suggests that the strategies are show dependent, not channel dependent

Page 20: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

20

Future work

• Further optimize the thresholds for the three strategies

• Find a representation to choose a specific value of the thresholds for each show individually (i.e. fully adapted the decision trees to each show)

• Use Metadata objects such as the confidence measure of each SU and diarization to further improve the strategies

Page 21: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

21

Conclusion

• Development of a new segmentation method based on Metadata and Speech-To-Text information

• Use features given by MDE and STT information in decision trees for each processing step

• Results indicate the promiss of this approach

• Further developments still seem to have room for improvement

Page 22: 1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

22

Acknowlegments

I would like to thank:• Prof. Bourlard & Prof. Morgan

• Barbara & Andreas

• Yang

• IM2 for supporting my experience