Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Evaluating Human-Machine Conversation for Appropriateness

David Benyon, Preben Hansen, Oli Mival and Nick Webb

Overview

• www.companions-project.org

• Companions are targeted as persistent, collaborative, conversational partners

• Rather than singular tasks, Companions have a range of tasks

• Completion of tasks is important

• So is conversational performance

Metrics

• Objective measures– WER, CER, Turn Duration, Vocabulary…

• Subjective user measures– User satisfaction surveys

• Appropriateness

Appropriateness

• D. Traum, S. Robinson and J. Stephan. Evaluation of multi-party virtual reality dialogue interaction, in LREC, 2004.

• Alongside traditional measures, introduces concept of “response appropriateness”

• Created for ICT/ISI mission rehearsal exercise system

Initial Companion Evaluation

• 2 Companion prototypes– Health & Fitness

– Senior Companion

• 8 users completed entire protocol

• All participants were native English speakers without strong accents

• Ages from 27 to 61

• 2 were female, 6 were male

Initial Companion Evaluation

• New version (2.0) of Senior Companion– 12 new participants– 9 male, 3 female (ages 21-38)

• Key changes– Facebook photographs (pre-tagged)– Loquendo TTS elements (cough, laugh)– Additional “chat” ability from a chatbot

• Improved metric results– Avg. words / utterance– 4.27 (v1) to 6.1 (v2)

I found the Companion engagingI found the Companion engaging

v1.0v1.0

v2.0v2.0

v1.0 SC vs v2.0

The Companion demonstrated emotion at timesThe Companion demonstrated emotion at times

v1.0v1.0

v2.0v2.0

v1.0 SC vs v2.0

Appropriateness

• Traum et al. devised an “appropriateness” coding scheme.

• Split system and user utterances.

• Users:– Response To System [RTS]– Gets RESponse [RES]– No Response: Appropriate [NRA]– No Response: Not appropriate [NRN]

• For agents:

–Filled Pause [FP]

–Request for Repair [RR]

–Appropriate Response [AR]

–Appropriate Question [AQ]

–Appropriate new INItiative [INI]

–Appropriate CONtinuation [CON]

– iNAPpropriate response, initiative or continuation [NAP]

3rd Phase - Appropriateness

Scoring Intuitions• Filled pauses generally human-like and good for virtual

agents to perform but don’t add a lot (0)

• Appropriate responses and questions very good (+2), but initiatives that push the interaction back on track are better (+3)

• Extended contributions on topic somewhat good (+.5)

• Repairs and clarifications bad (-.5), but their use can still gain points by allowing subsequent appropriate response

• Inappropriate response bad (-1), no response worse (-2)

Appropriateness Evaluation

• 7 HFC, 13 SC dialogues

• 4 Pre-Chatbot SC, 9 Post-Chatbot

• All annotated by a single user

• To check consistency, one dialogue of each annotated by at least 3 annotators

• Inter-coder reliability scored at over 0.8 (Krippendorff’s Alpha) for all

Sample HFC Dialogue <System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH"

time="02:51:06 PM" turn_no="16">Alright. How about ordering lunch from a takeaway?</System> [Appropriate Question]

<User time="02:51:13 PM" turn_no="17" /> [Response To System]

<System CS="SUGGEST-MEAL-SOURCE ORDER-TAKEAWAY LUNCH" time="02:51:25 PM" turn_no="18">Sorry, I didn't understand. Can you re-phrase: How about ordering lunch from a takeaway?</System> [Request Repair]

Sample SC Dialogue

<turn end_time="03:49:01 PM" mode="photo” speaker="S" start_time="03:48:59 PM" turn_no="65">What is your relationship to Libby.</turn>

[Appropriate Question]

<turn end_time="03:49:04 PM" mode="photo” speaker="S" start_time="03:49:01 PM" turn_no="66">I'm sorry I didn't understand your relationship to Libby.</turn> [Inappropriate Response]

…

<turn end_time="03:49:19 PM" mode="photo” speaker="U" turn_no="70"> could be as my friend</turn> [Response To System]

Average Score

Per Utterance Score

Tag Distribution

Initial Conclusions

• Seems to correlate with improvement in user responses (needs further investigation)

• Reliably encoded by annotators

• Indicates problem areas in dialogue

Tools and Resources

• XML encoded dialogue corpus

• Corpus collection tool

• Appropriateness annotation guidelines

• Appropriateness annotation tool

Next Steps

• Refine appropriateness measures– Add NEW tags

• confirmation, politeness, emotion,

– Modify existing tags• specific inappropriate tags

• Don’t have upper bounds of performance – require WoZ models

• Need to monitor users behaviour over time• Use scoring system to inform reinforcement

learning

Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Documents

v2 slide

male slide

appropriate nrn slide

appropriate question

utterance score slide

request repair slide

response res

average score slide