Top Banner
TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund {gabriel,edlund}@speech.kth.se Centre for Speech Technology Department of Speech, Music and Hearing KTH, Sweden
16

Early error detection on word level

Jan 03, 2016

Download

Documents

harlan-williams

Early error detection on word level. Gabriel Skantze and Jens Edlund {gabriel,edlund}@speech.kth.se Centre for Speech Technology Department of Speech, Music and Hearing KTH, Sweden. Overview. How do we handle errors in conversational human-computer dialogue? - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Early error detection on word level

TTCentre forSpeech Technology

Early error detection on word level

Gabriel Skantze and Jens Edlund{gabriel,edlund}@speech.kth.se

Centre for Speech TechnologyDepartment of Speech, Music and Hearing

KTH, Sweden

Page 2: Early error detection on word level

Overview

• How do we handle errors in conversational human-computer dialogue?

• Which features are useful for error detection in ASR results?

• Two studies on selected features:– Machine learning– Human subjects’ judgement

Page 3: Early error detection on word level

Error detection

• Early error detection– Detect if a given recognition result contains errors– e.g. Litman, D. J., Hirschberg, J., & Swertz, M. (2000).

• Late error detection– Feed back the interpretation of the utterance to the

user (grounding)– Based on the user’s reaction to that feedback, detect

errors in the original utterance– e.g. Krahmer, E., Swerts, M., Theune, T. & Weegels,

M. E. (2001). • Error prediction

– Detect that errors may occur later on in the dialogue– e.g. Walker, M. A., Langkilde-Geary, I., Wright Hastie,

H., Wright, J., & Gorin, A. (2002).

Page 4: Early error detection on word level

Why early error detection?

• ASR errors reflect errors in acoustic and language models. Why not fix them there?– Post-processing may consider systematic errors in

the models, due to mismatched training and usage conditions.

– Post-processing may help to pinpoint the actual problems in the models.

– Post-processing can include factors not considered by the ASR, such as:

• Prosody• Semantics• Dialogue history

Page 5: Early error detection on word level

Corpus collection

Vocoder

User Operator

Listens Speaks

ReadsSpeaks ASR

I have the lawn on my right and a house with number two on my left

i have the lawn on right is and a house with from two on left

Page 6: Early error detection on word level

Study I: Machine learning

• 4470 words• 73.2% correct (baseline)• 4/5 training data, 1/5 test data• Two ML algorithms tested

– Transformation-based learning (µ-TBL)• Learn a cascade of rules that transforms the

classification– Memory-based learning (TiMBL)

• Simply store each training instance in memory• Compare the test instance to the stored instances

and find the closest match

Page 7: Early error detection on word level

Features

Group Feature Explanation

Confidence Confidence Speech recognition word confidence score

Lexical Word The word

POS The part-of-speech for the word

Length The number of syllables in the word

Content Is it a content word?

Contextual PrevPOS The part-of-speech for the previous word

NextPOS The part-of-speech for the next word

PrevWord The previous word

Discourse PrevDialogueAct The dialogue act of the previous operator utterance

Mentioned Is it a content word that has been mentioned previously by the operator in the discourse?

Page 8: Early error detection on word level

Results

Feature set µ-TBL TiMBL

Confidence 77.3% 76.0%

Lexical 77.5% 78.0%

Lexical + Contextual 81.4% 82.8%

Lexical + Confidence 81.3% 81.0%

Lexical + Confidence + Contextual 83.9% 83.2%

Lexical + Confidence + Contextual + Discourse 85.1% 84.1%

• Content-words:– Baseline: 69.8%, µ-TBL: 87.7%, TiMBL: 87.0%

Page 9: Early error detection on word level

Rules learned by µ-TBL

Transformation Rule

TRUE > FALSE Confidence < 50 & Content = TRUE

TRUE > FALSE Confidence < 60 & POS = Verb & Length = 2

TRUE > FALSE Confidence < 40 & POS = Adverb & Length = 1

TRUE > FALSE Confidence < 50 & POS = Adverb & Length = 2

TRUE > FALSE Confidence < 40 & POS = Verb & Length = 1

FALSE > TRUE Confidence > 40 & Mentioned = TRUE & POS = Noun & Length = 2

Page 10: Early error detection on word level

Study II: Human error detection

• First 15 user utterances from 4 dialogues with high WER• 50% of the words correct (baseline)• 8 judges • Features were varied for each utterance:

– ASR information– Context information

Page 11: Early error detection on word level

Features

NoContext No context. ASR output only.

PreviousContext Previous utterance visible.

FullContext The dialogue history is given incrementally.

MapContext As FullContext, with the addition of the map.

NoConfidence Recognised string only.

Confidence Recognised string, colour coded for word confidence.

NBestList As Confidence, but the 5-best ASR result was given.

Page 12: Early error detection on word level

The judges’ interface

Utterance confidenceGrey scale reflect word confidence

5-best list

Dialogue so farCorrection field

Page 13: Early error detection on word level

Results

0,40

0,50

0,60

0,70

0,80

0,90

1,00

NOCONTEXT CONTEXT NOCONTEXT CONTEXT

All

NBESTLIST CONFIDENCE NOCONFIDENCE

Worst half

Best half

Page 14: Early error detection on word level

Conclusions & Discussion

• ML can be used for early error detection on word level, especially for content words.

• Word confidence scores have some use.

• Utterance context and lexical information improve the ML performance.

• A rule-learning algorithm such as transformation-based learning can be used to pinpoint the specific problems.

• N-best lists are useful for human subjects. How do we operationalise them for ML?

Page 15: Early error detection on word level

Conclusions & Discussion

• The ML improved only slightly from the discourse context.

– Further work in operationalising context for ML should focus on the previous utterance

• The classifier should be tested together with a parser or keyword spotter to see if it can improve performance.

• Other features should be investigated, such as prosody. These may improve performance further.

Page 16: Early error detection on word level

TTCentre forSpeech Technology

The End

Thank you for your attention!

Questions?