Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee.

Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee Maxine Eskenazi (chair) Alan W Black Reid Simmons Diane J. Litman

Spoken Dialog Systems 2 Spoken dialog systems have long promised to improve human-machine interaction Speech is a natural means of communication Recent improvements in underlying technologies have made such systems a reality

3 Sometimes they work S: U: S: U: S: U: S: U: S: What can I do for you? Id like to go to the Waterfront. Going to Waterfront. Is this correct? Yes. Alright. Where do you want to leave from? Oakland. Leaving from Oakland. When are you going to take that bus? Now. The next bus. Hold on. Let me check that for you. The next 61C leaves Forbes Avenue at Atwood Childrens Hospital at 5:16 PM. S U

4 but not always S: U: S: U: S: U: S: What can I do for you? kay. 51C Carrick from Century Square to Downtown Pittsburgh, to Oakland. The 61 If you want Leaving from Oakland. Is this correct? 51C leaving Century Square going to Oakland, I mean go to South Side. Leaving Leaving from McKeesport. Is No. Leaving from Century Square. Leaving from McKeesport. Did I get that right? S U

Key Definitions 5 (Conversational) Floor The right to address an assembly (Merriam-Webster) The interactional state that describes which participant in a dialog has the right to provide or request information at any point. Turn-Taking The process by which participants in a conversation alternately own the conversational floor.

Thesis Statement 6 Incorporating different levels of knowledge using a data-driven decision model will improve the turn-taking behavior of spoken dialog systems. Specifically, turn-taking can be modeled as a finite-state decision process operating under uncertainty.

7 Floor, Intentions and Beliefs The floor is not an observable state. Rather, participants have: intentions to claim the floor or not beliefs over whether others are claiming it Participants negotiate the floor to limit gaps and overlaps. [Sacks et al 1974, Clark 1996]

8 Uncertainty over the Floor S U Uncertainty over the floor leads to breakdowns in turn-taking: Cut-ins Latency Barge-in latency Self interruptions

9 Turn-Taking Errors by System U: S: kay. 51C Carrick from Century Square () The 61 S U S: U: S: () Is this correct? Yeah. Alright () S U Cut-ins System grabs floor before user releases it. Latency System waits after user has released floor.

S: U: S: What can I do for you? 61A. For example, you can say when is Where would you li Lets proceed step by step. Which neighb Leaving from North Side. Is this correct? 10 S: U: S: For example, you can say When is the next 28X from downtown to the airport? or Id like to go from McKee When is the next 54 Leaving from Atwood. Is this correct? S U S U Barge-in latency System keeps floor while user is claiming it. Self interruptions System releases floor while user not claiming it. Turn-Taking Errors by System

Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model Conclusion 11

Pipeline Architectures 12 Natural Language Understanding Dialog Management Backend Natural Language Generation Speech Recognition Speech Synthesis Turn-taking imposed by full-utterance-based processing Sequential processing Lack of reactivity No sharing of information across modules Hard to extend to multimodal/asynchronous events

Multi-layer Architectures Separate reactive from deliberative behavior turn-taking vs dialog act planning Different layers work asynchronously [Thorisson 1996, Allen et al 2001, Lemon et al 2003] But no previous work: addressed how conversational floor interacts with dialog management successfully deployed a multi-layer architecture in a broadly used system 13

Proposed Architecture: Olympus 2 14 Dialog Management Backend Interaction Management Natural Language Understanding Speech Recognition Sensors Natural Language Generation Speech Synthesis Actuators

Olympus 2 Architecture 15 Natural Language Understanding Speech Recognition Sensors Natural Language Generation Speech Synthesis Actuators Dialog Management Backend Interaction Management Explicitly models turn-taking explicitly Integrates dialog features from both low and high levels Operates on generalized events and actions Uses floor state to control planning of conversational acts

Olympus 2 Deployment Ported Lets Go to Olympus 2 publicly deployed telephone bus information originally built using Olympus 1 New version processed about 30,000 dialogs since deployed no performance degradation Allows research on turn-taking models to be guided by real users behavior 16

Outline Introduction An event-driven architecture for spoken dialog systems Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 17

End-of-Turn Detection 18 S U What can I do for you? Id like to go to the airport. Detecting when the user releases the floor. Potential problems: Cut-ins Latency

19 S U What can I do for you? Id like to go to the airport. End-of-Turn Detection End of turn

20 S U What can I do for you? Id like to go to the airport. Latency / Cut-in Tradeoff Long thresholdFew cut-insLong latency

21 S U What can I do for you? Id like to go to the airport. Latency / Cut-in Tradeoff Long thresholdFew cut-insLong latency Short thresholdMany cut-insShort latency Can we exploit dialog information to get the best of both worlds?

End-of-Turn Detection as Classification Classify pauses as internal/final based on words, syntax, prosody [Sato et al, 2002] Repeat classification every n milliseconds until pause ends or end-of-turn is detected [Ferrer et al, 2003, Takeuchi et al, 2004] But no previous work: successfully combined a wide range of features tested model in a real dialog system 22

Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 23

24 S U What can I do for you? Id like to go to the airport. Using Variable Thresholds Discourse (dialog state) Semantics (partial ASR) Prosody (F0, duration) Timing (pause start) Speaker (avg # pauses) Open question Specific question Confirmation Does partial hyp match current expectations?

Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has YES 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 25 Trained on 1326 dialogs with the Lets Go public dialog system

Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has YES 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 26 Trained on 1326 dialogs with the Lets Go public dialog system Id like to go to

Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has YES 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 27 Trained on 1326 dialogs with the Lets Go public dialog system Id like to go to the airport.

Performance per Feature Set 22% latency reduction 38% cut-in rate reduction 29

Performance per Feature Set Semantics is the most useful feature type 30 All features

Live Evaluation Implemented decision tree in the Lets Go IM Operating point: 3% cut-in, 635 ms average 1061 dialogs collected in May 08 548 control dialogs (threshold = 700 ms) 513 treatment dialogs (decision tree) 32

Cut-in Rate per Dialog State Largest improvement: after open requests 33 Fewer cut-ins overall (p

Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee.

Documents

s u slide

s u s u barge

floor s u uncertainty

system u

system slide

conversational floor

litman slide

reality slide