Top Banner
Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik
22

Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Mar 31, 2015

Download

Documents

Reid Bonfield
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Speech recognition in MUMIS

Judith Kessens, Mirjam Wester

& Helmer Strik

Page 2: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Manual transcriptions

• Transcriptions made by SPEX:– orthographic transcriptions– transcriptions on chunk level (2-3 sec.)

• Formats:– *.Textgrid praat– xml-derivatives:

• *.pri – no time information• *.skp – time information

Page 3: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Manual transcriptions

Total amount of transcribed matches on ftp-site (including the demo matches):

• Dutch: 6 matches

• German: 21 matches

• English: 3 matches

Extensions:

Dutch (_N), German (_G), English (_E)

Page 4: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Automatic speech recognition

1. Acoustic preprocessing

• Acoustic signal features

2. Speech recognition

• Acoustic models

• Language models

• Lexicon

Page 5: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Automatic transcriptions

• Problem of recorded data:

Commentaries and stadium noise are mixed Very high noise levels

Recognition of such extreme noisy data is very difficult

Page 6: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Examples of data

Yug-Ned match

• Dutch

• English

• German

“op _t ogenblik wordt in dit stadion de opstelling voorgelezen”

“and they wanna make the change before the corner”

“und die beiden Tore die die Hollaender bekommen hat haben”

Page 7: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Examples of data

Eng-Dld match

• Dutch

• English

• German

“geeft nu een vrije trap in _t voordeel van Ince”

“and phil neville had to really make about three yards to stop <dreisler*u> pulling it down and playing it”

“wurde von allen englischen Zeitungen aus der Mannschaft”

Page 8: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Evaluation of aut. transcriptions

insertions+deletions+substitutionsnumber of words

WER(%) =

WER can be larger than 100% !

Page 9: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

WERs (all words)

Dutch English German

Yug-Ned 84.5 84.5 77.4

Eng-Dld 83.2 83.3 90.8

Page 10: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

WERs (player names)

Dutch English German

Yug-Ned

names

84.5

53.0

84.5

48.2

77.4

40.9

Eng-Dld

names

83.2

55.0

83.3

56.2

90.8

77.4

Page 11: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

WERs versus SNR

Dutch English German

Yug-Ned

SNR

84.5

9

84.5

12

77.4

19

Eng-Dld

SNR

83.2

8

83.3

11

90.8

7

Page 12: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Automatic transcriptions

The language model (LM) and lexicon (lex) are adapted to a specific match

• Start with a general LM and lex• Add player names of the specific match• Expand the general LM and lex when more

data is available

Page 13: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

WERs for various amounts of data

76

80

84

88

92

96

0 50,000 100,000 150,000 200,000 250,000

number of words to train the language model

WE

R (

%)

Yug-Ned (Dutch) lex: 1CDEng-Dld (Dutch) lex: 1CDYug-Ned (German)lex: 1CDYug-Ned (German)lex: 7CDsYug-Ned (German)lex: 19CDsEng-Dld (German)lex: 7CDs

Page 14: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Oracle experiments - ICLSP’02

Due to limited amount of material we started off with oracle experiments:

• Language models are trained on target match

• Acoustic models are trained on part of target match or other match

Much lower WERs

Page 15: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Summary of results

Acoustic model training:

• Leaving out non-speech chunks does not hurt recognition performance

• Using more training data is benificial, but more important:

• The SNRs of the training and test data should be matched

Page 16: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Summary of results

• WERs are SNR-dependent

0

20

40

60

80

100

0 5 10 15 20

SNR (dB)

WER

(%) Dutch

English

German

(tested on Yug-Ned match)

Page 17: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Summary of results

0

20

40

60

80

Dutch English German

WER

(%)

function

content

names

all

Split words into categories, i.e. function words, content words and football player’s names:WER function words > WER content words > WER names

(tested on Yug-Ned match)

Page 18: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Summary of results• Noise reduction tool (FTNR) small improvement

WERs with and without FTNR

0

25

50

75

NL Eng Dld

WE

R (

%)

No FTNR FTNR

Page 19: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Ongoing work

Techniques to lower WERs• Tuning of the generic language model

– Defining different classes – Reduction of OOV words in lexicon and in the

language model (using more material)• Speaker Adaptation in HTK

(note: all other experiments are being carried out using Phicos)

Page 20: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Ongoing work

Noise robustness

• Extension of the acoustic models by using double deltas.

• Histogram Normalization and FTNR.

• SNR dependent acoustic models.

Page 21: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Recommendations

Acoustic modeling

• Record commentaries and stadium noise separately

• Speaker adaptation:

- Transcribe characteristics of commentator

- Collect more speech data of commentator

Page 22: Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Recommendations

Lexicon and language modeling

• Collect orthographic transcriptions of spoken material, instead of written material

- Subtitles

- Close captions