misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department dbohus Carnegie Mellon.

misunderstandings, corrections and beliefs in spoken language interfaces

Dan Bohus Computer Science Departmentwww.cs.cmu.edu/~dbohus Carnegie Mellon [email protected] Pittsburgh, PA 15213

2

problem

spoken language interfaces lack robustness when faced with understanding errors

stems mostly from speech recognition spans most domains and interaction types exacerbated by operating conditions

3

more concretely …

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you

leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to

answer the following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]S: traveling in the afternoon. Okay what day would you be departing

chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at

1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR

WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago

at 1:40pm arrives Seoul at ………

4

some statistics …

corrections [Krahmer, Swerts, Litman, Levow]

30% of utterances correct system mistakes 2-3 times more likely to be misrecognized

semantic error rates: ~25-35%

SpeechActs [SRI] 25%

CU Communicator [CU] 27%

Jupiter [MIT] 28%

CMU Communicator [CMU] 32%

How May I Help You? [AT&T] 36%

5

two types of understanding errors

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]

NON-understanding

System cannot extract any meaningful information from the user’s turn

S: What city are you leaving from?U: Birmingham [BERLIN PM]

System extracts incorrect information from the user’s turn

MIS-understanding

6

misunderstandings

S: What city are you leaving from?U: Birmingham [BERLIN PM]

System extracts incorrect information from the user’s turn

MIS-understanding

detect potential misunderstandings; do something about them

fix recognition

7

outline

detecting misunderstandings

detecting user corrections[late-detection of misunderstandings]

belief updating[construct accurate beliefs by integrating information from multiple turns]

8


recognition confidence scoresS: What city are you leaving from?U: Birmingham [BERLIN PM]

conf=0.63

traditionally [Bansal, Chase, Cox, Kemp, many others]

speech recognition confidence scores use acoustic, language model and search info frame, phoneme, word-level

9

“semantic” confidence scores

we’re interested in semantics, not words YES = YEAH, NO = NO WAY

use machine learning to build confidence annotators in-domain, manually labeled data

utterance: [BERLIN PM] Birmingham

labels: correct / misunderstood

features from different knowledge sources binary classification problem probability of misunderstanding: regression

problem

10

a typical result

Identifying User Corrections Automatically in a Spoken Dialog System [Walker, Wright, Langkilde]

HowMayIHelpYou corpus: call routing for phone services 11787 turns

features ASR: recog, numwords, duration, dtmf, rg-grammar, tempo … understanding: confidence, context-shift, top-task, diff-conf, … dialog & history: sys-label, confirmation, num-reprompts,

num-confirms, num-subdials, …

binary classification task majority baseline (error): 36.5% RIPPER (error): 14%

11

outline



belief updating [construct accurate beliefs by integrating information from multiple turns]

12

detect user corrections is the user trying to correct the system?

S: Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]

user correction

misunderstanding

misunderstanding

same story: use machine learning in-domain, manually labeled data features from different knowledge sources binary classification problem probability of correction: regression problem

13

typical result

Identifying User Corrections Automatically in a Spoken Dialog System [Hirschberg, Litman, Swerts]

TOOT corpus: access to train information 2328 turns, 152 dialogs

features prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo

… ASR: gram, str, conf, ynstr, … dialog position: diadist dialog history: preturn, prepreturn, pmeanf

binary classification task majority baseline: 29% RIPPER: 15.7%

14

outline



belief updating[construct accurate beliefs by integrating information from multiple turns]

15

belief updating problem: an easy case

S: on which day would you like to travel?U: on September 3rd

[AN DECEMBER THIRD] {CONF=0.25}

S: did you say you wanted to leave on December 3rd?

departure_date = {Dec-03/0.25}

departure_date = {Ø}

U: no

[NO] {CONF=0.88}

16

belief updating problem: a trickier case

S: Where would you like to go?U: Huntsville

[SEOUL] {CONF=0.65}

S: traveling to Seoul. What day did you need to travel?

destination = {seoul/0.65}

destination = {?}

U: no no I’m traveling to Birmingham

[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

17

given: an initial belief Pinitial(C) over

concept C a system action SA a user response R

construct an updated belief: Pupdated(C) ← f (Pinitial(C), SA, R)

belief updating problem formalized

S: traveling to Seoul. What day did you need to travel?

destination = {seoul/0.65}

destination = {?}

[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

18

outline



belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work

19

belief updating: current solutions

most systems only track values, not beliefs

new values overwrite old values explicit confirm + yes → trust hypothesis explicit confirm + no → kill hypothesis explicit confirm + “other” → non-understanding implicit confirm: not much

“users who discover errors through incorrect implicitconfirmations have a harder time getting back on track”[Shin et al, 2002]

20

outline




21

belief updating: general form

given: an initial belief Pinitial(C) over concept C a system action SA a user response R

construct an updated belief: Pupdated(C) ← f (Pinitial(C), SA, R)

22

restricted version: 2 simplifications

1. compact belief system unlikely to “hear” more than 3 or 4

values single vs. multiple recognition results

in our data: max = 3 values, only 6.9% have >1 value

confidence score of top hypothesis

2. updates after confirmation actions

reduced problem ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R)

23

outline




24

I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?

I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?

data

collected with RoomLine a phone-based mixed-initiative spoken dialog

system conference room reservation

search and negotiation

explicit and implicit confirmations confidence threshold model (+ some

exploration)

implicit confirmation task

25

user study 46 participants, 1st time users 10 scenarios, fixed order presented graphically (explained during briefing)

compensated per task success

26

corpus statistics

449 sessions, 8848 user turns orthographically transcribed manually annotated

misunderstandings (concept-level) non-understandings user corrections correct concept values

27

outline




28

user response types

following Krahmer and Swerts study on Dutch train-table information system

3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER

cross-tabulated against correctness of confirmations

29

user responses to explicit confirmations

YES NO Other

CORRECT94%

[93%]0% [0%] 5% [7%]

INCORRECT 1% [6%]72%

[57%]27%

[37%]~10%

from transcripts

[numbers in brackets from Krahmer&Swerts]

from decoded YES NO Other

CORRECT 87% 1% 12%

INCORRECT 1% 61% 38%

30

other responses to explicit confirmations

~70% users repeat the correct value ~15% users don’t address the question

attempt to shift conversation focus

User does not correct

User corrects

CORRECT 1159 0

INCORRECT 29 [10% of incor]

250[90% of incor]

31

user responses to implicit confirmations

YES NO Other

CORRECT30% [0%]

7% [0%]63%

[100%]

INCORRECT 6% [0%]33%

[15%]61%

[85%]

transcripts

[numbers in brackets from Krahmer&Swerts]

decodedYES NO Other

CORRECT 28% 5% 67%

INCORRECT 7% 27% 66%

32

ignoring errors in implicit confirmations

User does not correct

User corrects

CORRECT 552 2

INCORRECT 118 [51% of incor]

111[49% of incor]

users correct later (40% of 118) users interact strategically

correct only if essential

~correct later

correct later

~critical 55 2

critical 14 47

33

outline




34

machine learning approach

need good probability outputs low cross-entropy between model

predictions and reality cross-entropy = negative average log

posterior

logistic regression sample efficient stepwise approach → feature selection

logistic model tree for each action root splits on response-type

35

features. target.

initial situation initial confidence score concept identity, dialog state, turn number

system action other actions performed in parallel

features of the user response acoustic / prosodic features lexical features grammatical features dialog-level features

target: was the value correct?

36

baselines

initial baseline accuracy of system beliefs before the update

heuristic baseline accuracy of heuristic rule currently used in

the system

oracle baseline accuracy if we knew exactly when the user is

correcting the system

37

results: explicit confirmation

0

10

20

30

Hard

-err

or

(%)

0

0.2

0.4

0.6

So

ft-e

rro

r

InitialHeuristicLMTOracle

InitialHeuristicLMT

31.15

8.41

3.57 2.71

0.51

0.19

0.12

Explicit ConfirmationHard error (%) Soft error

38

0

10

20

30

Hard

-err

or

(%)

0

0.2

0.4

0.6

0.8

1

So

ft-e

rro

r

InitialHeuristicLMT

InitialHeuristicLMTOracle

30.40

23.37

16.1515.33

0.610.67

0.43

Implicit Confirmation

results: implicit confirmation

Hard error (%) Soft error

39

0

10

20

Hard

-err

or

(%)

0

0.2

0.4

0.6

So

ft-e

rro

r

InitialHeuristicLMT

InitialHeuristicLMTOracle15.40

14.3612.64

10.37

Unplanned Implicit Confirmation

0.430.46

0.34

results: unplanned implicit confirmation

Hard error (%) Soft error

40

informative features

initial confidence score prosody features barge-in expectation match repeated grammar slots concept id priors on concept values [not included in these

results]

41

outline




42

discussion

evaluation does it make sense? what would be a better evaluation?

current limitation: belief compression extending models to N hypothesis + other

current limitation: system actions extending models to cover all system actions

43

thank you!

44

a more subtle caveat distribution of training data

confidence annotator + heuristic update rules

distribution of run-time data confidence annotator + learned model

always a problem when interacting with the world!

hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach?

45

KL-divergence & cross-entropy KL divergence: D(p||q)

Cross-entropy: CH(p, q) = H(p) + D(p||q)

Negative log likelihood

)(

)(log)()||(

xq

xpxpqpD

)(log)(),( xqxpqpCH

)(log)( xqqLL

46

logistic regression regression model for binomial (binary) dependent

variables

fwefxP

1

1)|1( fw

xp

xp

)0(

)1(log

fit a model using max likelihood (avg log-likelihood) any stats package will do it for you

no R2 measure test fit using “likelihood ratio” test stepwise logistic regression

keep adding variables while data likelihood increases signif. use Bayesian information criterion to avoid overfitting

47

logistic regression

0 10% 20% 30% 40% 50%0

0.2

0.4

0.6

0.8

1

% Nonunderstandings (FNON)

P(T

as

k S

uc

ce

ss

= 1

)

48

logistic model tree

f

g

0 10% 20% 30% 40% 50%0

0.2

0.4

0.6

0.8

1


P(T

as

k S

uc

ce

ss

= 1

)

0 10% 20% 30% 40% 50%0

0.2

0.4

0.6

0.8

1


P(T

as

k S

uc

ce

ss

= 1

)

regression tree, but with logistic models on leaves

f=0 f=1

g>10g<=10

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department dbohus Carnegie Mellon.

Documents

misunderstandings s

chicago u

chicago chicago s

recognition slide

huntsville seoul s

birmingham berlin pm

regression problem slide

berlin pm birmingham