misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University [email protected] Pittsburgh, PA 15213
Dec 22, 2015
misunderstandings, corrections and beliefs in spoken language interfaces
Dan Bohus Computer Science Departmentwww.cs.cmu.edu/~dbohus Carnegie Mellon [email protected] Pittsburgh, PA 15213
2
problem
spoken language interfaces lack robustness when faced with understanding errors
stems mostly from speech recognition spans most domains and interaction types exacerbated by operating conditions
3
more concretely …
S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you
leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to
answer the following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]S: traveling in the afternoon. Okay what day would you be departing
chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at
1:40pm, arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR
WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago
at 1:40pm arrives Seoul at ………
4
some statistics …
corrections [Krahmer, Swerts, Litman, Levow]
30% of utterances correct system mistakes 2-3 times more likely to be misrecognized
semantic error rates: ~25-35%
SpeechActs [SRI] 25%
CU Communicator [CU] 27%
Jupiter [MIT] 28%
CMU Communicator [CMU] 32%
How May I Help You? [AT&T] 36%
5
two types of understanding errors
S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]
NON-understanding
System cannot extract any meaningful information from the user’s turn
S: What city are you leaving from?U: Birmingham [BERLIN PM]
System extracts incorrect information from the user’s turn
MIS-understanding
6
misunderstandings
S: What city are you leaving from?U: Birmingham [BERLIN PM]
System extracts incorrect information from the user’s turn
MIS-understanding
detect potential misunderstandings; do something about them
fix recognition
7
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating[construct accurate beliefs by integrating information from multiple turns]
8
detecting misunderstandings
recognition confidence scoresS: What city are you leaving from?U: Birmingham [BERLIN PM]
conf=0.63
traditionally [Bansal, Chase, Cox, Kemp, many others]
speech recognition confidence scores use acoustic, language model and search info frame, phoneme, word-level
9
“semantic” confidence scores
we’re interested in semantics, not words YES = YEAH, NO = NO WAY
use machine learning to build confidence annotators in-domain, manually labeled data
utterance: [BERLIN PM] Birmingham
labels: correct / misunderstood
features from different knowledge sources binary classification problem probability of misunderstanding: regression
problem
10
a typical result
Identifying User Corrections Automatically in a Spoken Dialog System [Walker, Wright, Langkilde]
HowMayIHelpYou corpus: call routing for phone services 11787 turns
features ASR: recog, numwords, duration, dtmf, rg-grammar, tempo … understanding: confidence, context-shift, top-task, diff-conf, … dialog & history: sys-label, confirmation, num-reprompts,
num-confirms, num-subdials, …
binary classification task majority baseline (error): 36.5% RIPPER (error): 14%
11
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating [construct accurate beliefs by integrating information from multiple turns]
12
detect user corrections is the user trying to correct the system?
S: Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]
user correction
misunderstanding
misunderstanding
same story: use machine learning in-domain, manually labeled data features from different knowledge sources binary classification problem probability of correction: regression problem
13
typical result
Identifying User Corrections Automatically in a Spoken Dialog System [Hirschberg, Litman, Swerts]
TOOT corpus: access to train information 2328 turns, 152 dialogs
features prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo
… ASR: gram, str, conf, ynstr, … dialog position: diadist dialog history: preturn, prepreturn, pmeanf
binary classification task majority baseline: 29% RIPPER: 15.7%
14
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating[construct accurate beliefs by integrating information from multiple turns]
15
belief updating problem: an easy case
S: on which day would you like to travel?U: on September 3rd
[AN DECEMBER THIRD] {CONF=0.25}
S: did you say you wanted to leave on December 3rd?
departure_date = {Dec-03/0.25}
departure_date = {Ø}
U: no
[NO] {CONF=0.88}
16
belief updating problem: a trickier case
S: Where would you like to go?U: Huntsville
[SEOUL] {CONF=0.65}
S: traveling to Seoul. What day did you need to travel?
destination = {seoul/0.65}
destination = {?}
U: no no I’m traveling to Birmingham
[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
17
given: an initial belief Pinitial(C) over
concept C a system action SA a user response R
construct an updated belief: Pupdated(C) ← f (Pinitial(C), SA, R)
belief updating problem formalized
S: traveling to Seoul. What day did you need to travel?
destination = {seoul/0.65}
destination = {?}
[THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
18
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
19
belief updating: current solutions
most systems only track values, not beliefs
new values overwrite old values explicit confirm + yes → trust hypothesis explicit confirm + no → kill hypothesis explicit confirm + “other” → non-understanding implicit confirm: not much
“users who discover errors through incorrect implicitconfirmations have a harder time getting back on track”[Shin et al, 2002]
20
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
21
belief updating: general form
given: an initial belief Pinitial(C) over concept C a system action SA a user response R
construct an updated belief: Pupdated(C) ← f (Pinitial(C), SA, R)
22
restricted version: 2 simplifications
1. compact belief system unlikely to “hear” more than 3 or 4
values single vs. multiple recognition results
in our data: max = 3 values, only 6.9% have >1 value
confidence score of top hypothesis
2. updates after confirmation actions
reduced problem ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R)
23
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
24
I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?
I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one?
data
collected with RoomLine a phone-based mixed-initiative spoken dialog
system conference room reservation
search and negotiation
explicit and implicit confirmations confidence threshold model (+ some
exploration)
implicit confirmation task
25
user study 46 participants, 1st time users 10 scenarios, fixed order presented graphically (explained during briefing)
compensated per task success
26
corpus statistics
449 sessions, 8848 user turns orthographically transcribed manually annotated
misunderstandings (concept-level) non-understandings user corrections correct concept values
27
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
28
user response types
following Krahmer and Swerts study on Dutch train-table information system
3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER
cross-tabulated against correctness of confirmations
29
user responses to explicit confirmations
YES NO Other
CORRECT94%
[93%]0% [0%] 5% [7%]
INCORRECT 1% [6%]72%
[57%]27%
[37%]~10%
from transcripts
[numbers in brackets from Krahmer&Swerts]
from decoded YES NO Other
CORRECT 87% 1% 12%
INCORRECT 1% 61% 38%
30
other responses to explicit confirmations
~70% users repeat the correct value ~15% users don’t address the question
attempt to shift conversation focus
User does not correct
User corrects
CORRECT 1159 0
INCORRECT 29 [10% of incor]
250[90% of incor]
31
user responses to implicit confirmations
YES NO Other
CORRECT30% [0%]
7% [0%]63%
[100%]
INCORRECT 6% [0%]33%
[15%]61%
[85%]
transcripts
[numbers in brackets from Krahmer&Swerts]
decodedYES NO Other
CORRECT 28% 5% 67%
INCORRECT 7% 27% 66%
32
ignoring errors in implicit confirmations
User does not correct
User corrects
CORRECT 552 2
INCORRECT 118 [51% of incor]
111[49% of incor]
users correct later (40% of 118) users interact strategically
correct only if essential
~correct later
correct later
~critical 55 2
critical 14 47
33
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
34
machine learning approach
need good probability outputs low cross-entropy between model
predictions and reality cross-entropy = negative average log
posterior
logistic regression sample efficient stepwise approach → feature selection
logistic model tree for each action root splits on response-type
35
features. target.
initial situation initial confidence score concept identity, dialog state, turn number
system action other actions performed in parallel
features of the user response acoustic / prosodic features lexical features grammatical features dialog-level features
target: was the value correct?
36
baselines
initial baseline accuracy of system beliefs before the update
heuristic baseline accuracy of heuristic rule currently used in
the system
oracle baseline accuracy if we knew exactly when the user is
correcting the system
37
results: explicit confirmation
0
10
20
30
Hard
-err
or
(%)
0
0.2
0.4
0.6
So
ft-e
rro
r
InitialHeuristicLMTOracle
InitialHeuristicLMT
31.15
8.41
3.57 2.71
0.51
0.19
0.12
Explicit ConfirmationHard error (%) Soft error
38
0
10
20
30
Hard
-err
or
(%)
0
0.2
0.4
0.6
0.8
1
So
ft-e
rro
r
InitialHeuristicLMT
InitialHeuristicLMTOracle
30.40
23.37
16.1515.33
0.610.67
0.43
Implicit Confirmation
results: implicit confirmation
Hard error (%) Soft error
39
0
10
20
Hard
-err
or
(%)
0
0.2
0.4
0.6
So
ft-e
rro
r
InitialHeuristicLMT
InitialHeuristicLMTOracle15.40
14.3612.64
10.37
Unplanned Implicit Confirmation
0.430.46
0.34
results: unplanned implicit confirmation
Hard error (%) Soft error
40
informative features
initial confidence score prosody features barge-in expectation match repeated grammar slots concept id priors on concept values [not included in these
results]
41
outline
detecting misunderstandings
detecting user corrections[late-detection of misunderstandings]
belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
42
discussion
evaluation does it make sense? what would be a better evaluation?
current limitation: belief compression extending models to N hypothesis + other
current limitation: system actions extending models to cover all system actions
43
thank you!
44
a more subtle caveat distribution of training data
confidence annotator + heuristic update rules
distribution of run-time data confidence annotator + learned model
always a problem when interacting with the world!
hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach?
45
KL-divergence & cross-entropy KL divergence: D(p||q)
Cross-entropy: CH(p, q) = H(p) + D(p||q)
Negative log likelihood
)(
)(log)()||(
xq
xpxpqpD
)(log)(),( xqxpqpCH
)(log)( xqqLL
46
logistic regression regression model for binomial (binary) dependent
variables
fwefxP
1
1)|1( fw
xp
xp
)0(
)1(log
fit a model using max likelihood (avg log-likelihood) any stats package will do it for you
no R2 measure test fit using “likelihood ratio” test stepwise logistic regression
keep adding variables while data likelihood increases signif. use Bayesian information criterion to avoid overfitting
47
logistic regression
0 10% 20% 30% 40% 50%0
0.2
0.4
0.6
0.8
1
% Nonunderstandings (FNON)
P(T
as
k S
uc
ce
ss
= 1
)
48
logistic model tree
f
g
0 10% 20% 30% 40% 50%0
0.2
0.4
0.6
0.8
1
% Nonunderstandings (FNON)
P(T
as
k S
uc
ce
ss
= 1
)
0 10% 20% 30% 40% 50%0
0.2
0.4
0.6
0.8
1
% Nonunderstandings (FNON)
P(T
as
k S
uc
ce
ss
= 1
)
regression tree, but with logistic models on leaves
f=0 f=1
g>10g<=10