Speech recognition in MUMIS Judith Kessens, Mirjam Wester & Helmer Strik.

Post on 31-Mar-2015

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Speech recognition in MUMIS

Judith Kessens, Mirjam Wester

& Helmer Strik

Manual transcriptions

• Transcriptions made by SPEX:– orthographic transcriptions– transcriptions on chunk level (2-3 sec.)

• Formats:– *.Textgrid praat– xml-derivatives:

• *.pri – no time information• *.skp – time information

Manual transcriptions

Total amount of transcribed matches on ftp-site (including the demo matches):

• Dutch: 6 matches

• German: 21 matches

• English: 3 matches

Extensions:

Dutch (_N), German (_G), English (_E)

Automatic speech recognition

1. Acoustic preprocessing

• Acoustic signal features

2. Speech recognition

• Acoustic models

• Language models

• Lexicon

Automatic transcriptions

• Problem of recorded data:

Commentaries and stadium noise are mixed Very high noise levels

Recognition of such extreme noisy data is very difficult

Examples of data

Yug-Ned match

• Dutch

• English

• German

“op _t ogenblik wordt in dit stadion de opstelling voorgelezen”

“and they wanna make the change before the corner”

“und die beiden Tore die die Hollaender bekommen hat haben”

Examples of data

Eng-Dld match

• Dutch

• English

• German

“geeft nu een vrije trap in _t voordeel van Ince”

“and phil neville had to really make about three yards to stop <dreisler*u> pulling it down and playing it”

“wurde von allen englischen Zeitungen aus der Mannschaft”

Evaluation of aut. transcriptions

insertions+deletions+substitutionsnumber of words

WER(%) =

WER can be larger than 100% !

WERs (all words)

Dutch English German

Yug-Ned 84.5 84.5 77.4

Eng-Dld 83.2 83.3 90.8

WERs (player names)

Dutch English German

Yug-Ned

names

84.5

53.0

84.5

48.2

77.4

40.9

Eng-Dld

names

83.2

55.0

83.3

56.2

90.8

77.4

WERs versus SNR

Dutch English German

Yug-Ned

SNR

84.5

9

84.5

12

77.4

19

Eng-Dld

SNR

83.2

8

83.3

11

90.8

7

Automatic transcriptions

The language model (LM) and lexicon (lex) are adapted to a specific match

• Start with a general LM and lex• Add player names of the specific match• Expand the general LM and lex when more

data is available

WERs for various amounts of data

76

80

84

88

92

96

0 50,000 100,000 150,000 200,000 250,000

number of words to train the language model

WE

R (

%)

Yug-Ned (Dutch) lex: 1CDEng-Dld (Dutch) lex: 1CDYug-Ned (German)lex: 1CDYug-Ned (German)lex: 7CDsYug-Ned (German)lex: 19CDsEng-Dld (German)lex: 7CDs

Oracle experiments - ICLSP’02

Due to limited amount of material we started off with oracle experiments:

• Language models are trained on target match

• Acoustic models are trained on part of target match or other match

Much lower WERs

Summary of results

Acoustic model training:

• Leaving out non-speech chunks does not hurt recognition performance

• Using more training data is benificial, but more important:

• The SNRs of the training and test data should be matched

Summary of results

• WERs are SNR-dependent

0

20

40

60

80

100

0 5 10 15 20

SNR (dB)

WER

(%) Dutch

English

German

(tested on Yug-Ned match)

Summary of results

0

20

40

60

80

Dutch English German

WER

(%)

function

content

names

all

Split words into categories, i.e. function words, content words and football player’s names:WER function words > WER content words > WER names

(tested on Yug-Ned match)

Summary of results• Noise reduction tool (FTNR) small improvement

WERs with and without FTNR

0

25

50

75

NL Eng Dld

WE

R (

%)

No FTNR FTNR

Ongoing work

Techniques to lower WERs• Tuning of the generic language model

– Defining different classes – Reduction of OOV words in lexicon and in the

language model (using more material)• Speaker Adaptation in HTK

(note: all other experiments are being carried out using Phicos)

Ongoing work

Noise robustness

• Extension of the acoustic models by using double deltas.

• Histogram Normalization and FTNR.

• SNR dependent acoustic models.

Recommendations

Acoustic modeling

• Record commentaries and stadium noise separately

• Speaker adaptation:

- Transcribe characteristics of commentator

- Collect more speech data of commentator

Recommendations

Lexicon and language modeling

• Collect orthographic transcriptions of spoken material, instead of written material

- Subtitles

- Close captions

top related