Top Banner
Evaluating Human- Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb
20

Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Evaluating Human-Machine Conversation for Appropriateness

David Benyon, Preben Hansen, Oli Mival and Nick Webb

Page 2: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Overview

• www.companions-project.org

• Companions are targeted as persistent, collaborative, conversational partners

• Rather than singular tasks, Companions have a range of tasks

• Completion of tasks is important

• So is conversational performance

Page 3: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Metrics

• Objective measures– WER, CER, Turn Duration, Vocabulary…

• Subjective user measures– User satisfaction surveys

• Appropriateness

Page 4: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Appropriateness

• D. Traum, S. Robinson and J. Stephan. Evaluation of multi-party virtual reality dialogue interaction, in LREC, 2004.

• Alongside traditional measures, introduces concept of “response appropriateness”

• Created for ICT/ISI mission rehearsal exercise system

Page 5: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Initial Companion Evaluation

• 2 Companion prototypes– Health & Fitness

– Senior Companion

• 8 users completed entire protocol

• All participants were native English speakers without strong accents

• Ages from 27 to 61

• 2 were female, 6 were male

Page 6: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Initial Companion Evaluation

• New version (2.0) of Senior Companion– 12 new participants– 9 male, 3 female (ages 21-38)

• Key changes– Facebook photographs (pre-tagged)– Loquendo TTS elements (cough, laugh)– Additional “chat” ability from a chatbot

• Improved metric results– Avg. words / utterance– 4.27 (v1) to 6.1 (v2)

Page 7: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

I found the Companion engagingI found the Companion engaging

v1.0v1.0

v2.0v2.0

v1.0 SC vs v2.0

Page 8: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

The Companion demonstrated emotion at timesThe Companion demonstrated emotion at times

v1.0v1.0

v2.0v2.0

v1.0 SC vs v2.0

Page 9: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Appropriateness

• Traum et al. devised an “appropriateness” coding scheme.

• Split system and user utterances.

• Users:– Response To System [RTS]– Gets RESponse [RES]– No Response: Appropriate [NRA]– No Response: Not appropriate [NRN]

Page 10: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

• For agents:

–Filled Pause [FP]

–Request for Repair [RR]

–Appropriate Response [AR]

–Appropriate Question [AQ]

–Appropriate new INItiative [INI]

–Appropriate CONtinuation [CON]

– iNAPpropriate response, initiative or continuation [NAP]

3rd Phase - Appropriateness

Page 11: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Scoring Intuitions• Filled pauses generally human-like and good for virtual

agents to perform but don’t add a lot (0)

• Appropriate responses and questions very good (+2), but initiatives that push the interaction back on track are better (+3)

• Extended contributions on topic somewhat good (+.5)

• Repairs and clarifications bad (-.5), but their use can still gain points by allowing subsequent appropriate response

• Inappropriate response bad (-1), no response worse (-2)

Page 12: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Appropriateness Evaluation

• 7 HFC, 13 SC dialogues

• 4 Pre-Chatbot SC, 9 Post-Chatbot

• All annotated by a single user

• To check consistency, one dialogue of each annotated by at least 3 annotators

• Inter-coder reliability scored at over 0.8 (Krippendorff’s Alpha) for all

Page 13: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Sample HFC Dialogue <System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH"

time="02:51:06 PM" turn_no="16">Alright. How about ordering lunch from a takeaway?</System> [Appropriate Question]

<User time="02:51:13 PM" turn_no="17" /> [Response To System]

<System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH" time="02:51:25 PM" turn_no="18">Sorry, I didn't understand. Can you re-phrase: How about ordering lunch from a takeaway?</System> [Request Repair]

Page 14: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Sample SC Dialogue

<turn end_time="03:49:01 PM" mode="photo” speaker="S" start_time="03:48:59 PM" turn_no="65">What is your relationship to Libby.</turn>

[Appropriate Question]

<turn end_time="03:49:04 PM" mode="photo” speaker="S" start_time="03:49:01 PM" turn_no="66">I'm sorry I didn't understand your relationship to Libby.</turn> [Inappropriate Response]

<turn end_time="03:49:19 PM" mode="photo” speaker="U" turn_no="70"> could be as my friend</turn> [Response To System]

Page 15: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Average Score

Page 16: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Per Utterance Score

Page 17: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Tag Distribution

Page 18: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Initial Conclusions

• Seems to correlate with improvement in user responses (needs further investigation)

• Reliably encoded by annotators

• Indicates problem areas in dialogue

Page 19: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Tools and Resources

• XML encoded dialogue corpus

• Corpus collection tool

• Appropriateness annotation guidelines

• Appropriateness annotation tool

Page 20: Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Next Steps

• Refine appropriateness measures– Add NEW tags

• confirmation, politeness, emotion,

– Modify existing tags• specific inappropriate tags

• Don’t have upper bounds of performance – require WoZ models

• Need to monitor users behaviour over time• Use scoring system to inform reinforcement

learning