Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee Maxine Eskenazi (chair) Alan W Black Reid Simmons Diane J. Litman
106
Embed
Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Slide 2
Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis
Defense Antoine Raux Language Technologies Institute, CMU December
12, 2008 Thesis Committee Maxine Eskenazi (chair) Alan W Black Reid
Simmons Diane J. Litman
Slide 3
Spoken Dialog Systems 2 Spoken dialog systems have long
promised to improve human-machine interaction Speech is a natural
means of communication Recent improvements in underlying
technologies have made such systems a reality
Slide 4
3 Sometimes they work S: U: S: U: S: U: S: U: S: What can I do
for you? Id like to go to the Waterfront. Going to Waterfront. Is
this correct? Yes. Alright. Where do you want to leave from?
Oakland. Leaving from Oakland. When are you going to take that bus?
Now. The next bus. Hold on. Let me check that for you. The next 61C
leaves Forbes Avenue at Atwood Childrens Hospital at 5:16 PM. S
U
Slide 5
4 but not always S: U: S: U: S: U: S: What can I do for you?
kay. 51C Carrick from Century Square to Downtown Pittsburgh, to
Oakland. The 61 If you want Leaving from Oakland. Is this correct?
51C leaving Century Square going to Oakland, I mean go to South
Side. Leaving Leaving from McKeesport. Is No. Leaving from Century
Square. Leaving from McKeesport. Did I get that right? S U
Slide 6
Key Definitions 5 (Conversational) Floor The right to address
an assembly (Merriam-Webster) The interactional state that
describes which participant in a dialog has the right to provide or
request information at any point. Turn-Taking The process by which
participants in a conversation alternately own the conversational
floor.
Slide 7
Thesis Statement 6 Incorporating different levels of knowledge
using a data-driven decision model will improve the turn-taking
behavior of spoken dialog systems. Specifically, turn-taking can be
modeled as a finite-state decision process operating under
uncertainty.
Slide 8
7 Floor, Intentions and Beliefs The floor is not an observable
state. Rather, participants have: intentions to claim the floor or
not beliefs over whether others are claiming it Participants
negotiate the floor to limit gaps and overlaps. [Sacks et al 1974,
Clark 1996]
Slide 9
8 Uncertainty over the Floor S U Uncertainty over the floor
leads to breakdowns in turn-taking: Cut-ins Latency Barge-in
latency Self interruptions
Slide 10
9 Turn-Taking Errors by System U: S: kay. 51C Carrick from
Century Square () The 61 S U S: U: S: () Is this correct? Yeah.
Alright () S U Cut-ins System grabs floor before user releases it.
Latency System waits after user has released floor.
Slide 11
S: U: S: What can I do for you? 61A. For example, you can say
when is Where would you li Lets proceed step by step. Which neighb
Leaving from North Side. Is this correct? 10 S: U: S: For example,
you can say When is the next 28X from downtown to the airport? or
Id like to go from McKee When is the next 54 Leaving from Atwood.
Is this correct? S U S U Barge-in latency System keeps floor while
user is claiming it. Self interruptions System releases floor while
user not claiming it. Turn-Taking Errors by System
Slide 12
Outline Introduction An architecture for dialog and interaction
management Using dialog features to inform turn-taking A
domain-independent data-driven turn-taking model Conclusion 11
Slide 13
Pipeline Architectures 12 Natural Language Understanding Dialog
Management Backend Natural Language Generation Speech Recognition
Speech Synthesis Turn-taking imposed by full-utterance-based
processing Sequential processing Lack of reactivity No sharing of
information across modules Hard to extend to
multimodal/asynchronous events
Slide 14
Multi-layer Architectures Separate reactive from deliberative
behavior turn-taking vs dialog act planning Different layers work
asynchronously [Thorisson 1996, Allen et al 2001, Lemon et al 2003]
But no previous work: addressed how conversational floor interacts
with dialog management successfully deployed a multi-layer
architecture in a broadly used system 13
Olympus 2 Architecture 15 Natural Language Understanding Speech
Recognition Sensors Natural Language Generation Speech Synthesis
Actuators Dialog Management Backend Interaction Management
Explicitly models turn-taking explicitly Integrates dialog features
from both low and high levels Operates on generalized events and
actions Uses floor state to control planning of conversational
acts
Slide 17
Olympus 2 Deployment Ported Lets Go to Olympus 2 publicly
deployed telephone bus information originally built using Olympus 1
New version processed about 30,000 dialogs since deployed no
performance degradation Allows research on turn-taking models to be
guided by real users behavior 16
Slide 18
Outline Introduction An event-driven architecture for spoken
dialog systems Using dialog features to inform turn-taking
End-of-turn detection Decision tree-based thresholds Batch
evaluation Live evaluation 17
Slide 19
End-of-Turn Detection 18 S U What can I do for you? Id like to
go to the airport. Detecting when the user releases the floor.
Potential problems: Cut-ins Latency
Slide 20
19 S U What can I do for you? Id like to go to the airport.
End-of-Turn Detection End of turn
Slide 21
20 S U What can I do for you? Id like to go to the airport.
Latency / Cut-in Tradeoff Long thresholdFew cut-insLong
latency
Slide 22
21 S U What can I do for you? Id like to go to the airport.
Latency / Cut-in Tradeoff Long thresholdFew cut-insLong latency
Short thresholdMany cut-insShort latency Can we exploit dialog
information to get the best of both worlds?
Slide 23
End-of-Turn Detection as Classification Classify pauses as
internal/final based on words, syntax, prosody [Sato et al, 2002]
Repeat classification every n milliseconds until pause ends or
end-of-turn is detected [Ferrer et al, 2003, Takeuchi et al, 2004]
But no previous work: successfully combined a wide range of
features tested model in a real dialog system 22
Slide 24
Outline Introduction An architecture for dialog and interaction
management Using dialog features to inform turn-taking End-of-turn
detection Decision tree-based thresholds Batch evaluation Live
evaluation 23
Slide 25
24 S U What can I do for you? Id like to go to the airport.
Using Variable Thresholds Discourse (dialog state) Semantics
(partial ASR) Prosody (F0, duration) Timing (pause start) Speaker
(avg # pauses) Open question Specific question Confirmation Does
partial hyp match current expectations?
Slide 26
Example Decision Tree Utterance duration < 2000 ms Partial
ASR matches expectations Average pause duration < 200 ms 205 ms
Partial ASR has YES 200 ms693 ms Dialog state is open question
Partial ASR has less than 3 words 789 ms1005 ms Average pause
duration < 300 ms Partial ASR is available 637 ms847 ms907 ms
Average non-understanding ratio < 15% 779 ms Consecutive user
turns w/o system prompt 1440 ms Dialog state is open question 1078
ms Average pause duration < 300 ms 839 ms922 ms 25 Trained on
1326 dialogs with the Lets Go public dialog system
Slide 27
Example Decision Tree Utterance duration < 2000 ms Partial
ASR matches expectations Average pause duration < 200 ms 205 ms
Partial ASR has YES 200 ms693 ms Dialog state is open question
Partial ASR has less than 3 words 789 ms1005 ms Average pause
duration < 300 ms Partial ASR is available 637 ms847 ms907 ms
Average non-understanding ratio < 15% 779 ms Consecutive user
turns w/o system prompt 1440 ms Dialog state is open question 1078
ms Average pause duration < 300 ms 839 ms922 ms 26 Trained on
1326 dialogs with the Lets Go public dialog system Id like to go
to
Slide 28
Example Decision Tree Utterance duration < 2000 ms Partial
ASR matches expectations Average pause duration < 200 ms 205 ms
Partial ASR has YES 200 ms693 ms Dialog state is open question
Partial ASR has less than 3 words 789 ms1005 ms Average pause
duration < 300 ms Partial ASR is available 637 ms847 ms907 ms
Average non-understanding ratio < 15% 779 ms Consecutive user
turns w/o system prompt 1440 ms Dialog state is open question 1078
ms Average pause duration < 300 ms 839 ms922 ms 27 Trained on
1326 dialogs with the Lets Go public dialog system Id like to go to
the airport.
Slide 29
Outline Introduction An architecture for dialog and interaction
management Using dialog features to inform turn-taking End-of-turn
detection Decision tree-based thresholds Batch evaluation Live
evaluation 28
Slide 30
Performance per Feature Set 22% latency reduction 38% cut-in
rate reduction 29
Slide 31
Performance per Feature Set Semantics is the most useful
feature type 30 All features
Slide 32
Outline Introduction An architecture for dialog and interaction
management Using dialog features to inform turn-taking End-of-turn
detection Decision tree-based thresholds Batch evaluation Live
evaluation 31
Slide 33
Live Evaluation Implemented decision tree in the Lets Go IM
Operating point: 3% cut-in, 635 ms average 1061 dialogs collected
in May 08 548 control dialogs (threshold = 700 ms) 513 treatment
dialogs (decision tree) 32
Slide 34
Cut-in Rate per Dialog State Largest improvement: after open
requests 33 Fewer cut-ins overall (p