Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

Facial expression as an input annotation modality for affective speech-to-speech translation va Szkely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen University College Dublin

Introduction Expressive speech synthesis in human interaction Speech-to-speech translation: audiovisual input, affective state does not need to be predicted from text Facial expression as an input annotation modality for affective speech-to-speech translation

Introduction Goal: Transferring paralinguistic information from source to target language by means of an intermediate, symbolic representation: facial expression as an input annotation modality. FEAST: Facial Expression-based Affective Speech Translation Facial expression as an input annotation modality for affective speech-to-speech translation

System Architecture of FEAST Facial expression as an input annotation modality for affective speech-to-speech translation

Face detection and analysis SHORE library for real-time face detection and analysis http://www.iis.fraunhofer.de/en/bf/bsy/produkte/shore/ Facial expression as an input annotation modality for affective speech-to-speech translation

Emotion classification and style selection Aim of the facial expression analysis in FEAST system: a single decision regarding the emotional state of the speaker over each utterance Visual emotion classifier, trained on segments of the SEMAINE database, with input features from SHORE Facial expression as an input annotation modality for affective speech-to-speech translation

Expressive speech synthesis Expressive unit-selection synthesis using the open-source synthesis platform MARY TTS German male voice dfki-pavoque-styles : Cheerful Depressed Aggressive Neutral Facial expression as an input annotation modality for affective speech-to-speech translation

The SEMAINE database (semaine-db.eu)semaine-db.eu Audiovisual database collected to study natural social signals occurring in English conversations Conversations with four emotionally stereotyped characters: Poppy (happy, outgoing) Obadiah (sad, depressive) Spike (angry, confrontational) Prudence (even tempered, sensible) Facial expression as an input annotation modality for affective speech-to-speech translation

Evaluation experiments 1.Does the system accurately classify emotion on the utterance level, based on the facial expression in the video input? 2.Do the synthetic voice styles succeed in conveying the target emotion category? 3.Do listeners agree with the cross-lingual transfer of paralinguistic information from the multimodal stimuli to the expressive synthetic output? Facial expression as an input annotation modality for affective speech-to-speech translation

Experiment 1: Classification of facial expressions Support Vector Machine (SVM) classifier trained on utterances of the male operators from the SEMAINE database 535 utterances used for training, 107 for testing Facial expression as an input annotation modality for affective speech-to-speech translation

Experiment 2: Perception of expressive synthesis Perception experiment with 20 subjects Listen to natural and synthesised stimuli and choose which voice style describes the utterance best: Cheerful Depressed Aggressive Neutral Facial expression as an input annotation modality for affective speech-to-speech translation

Experiment 2: Results

Experiment 3: Adequacy for S2S translation Perceptual experiment with 14 bilingual participants 24 utterances from SEMAINE operator data and their corresponding translation in each voice style Listeners were asked to choose which German translation matches the original video best. Facial expression as an input annotation modality for affective speech-to-speech translation

NCADNCAD Examples - Poppy (happy) Facial expression as an input annotation modality for affective speech-to-speech translation

NCADNCAD Examples - Prudence (neutral) Facial expression as an input annotation modality for affective speech-to-speech translation

NCADNCAD Examples - Spike (angry) Facial expression as an input annotation modality for affective speech-to-speech translation

NCADNCAD Examples - Obadiah (sad) Facial expression as an input annotation modality for affective speech-to-speech translation

Experiment 3: Results Facial expression as an input annotation modality for affective speech-to-speech translation

Conclusion Preserving the paralinguistic content of a message across languages is possible with significantly greater than chance accuracy Visual emotion classifier performed with an overall 63.5% accuracy Cheerful/happy is often mistaken for neutral (conditioned by the voice) Facial expression as an input annotation modality for affective speech-to-speech translation

Future Work Extending the classifier to compute the prediction of the affective state of the user based on acoustic and prosodic analysis as well as facial expressions. Demonstration of the prototype system that takes live input through a webcamera and microphone. Integration of a speech recogniser and a machine translation component Facial expression as an input annotation modality for affective speech-to-speech translation

Questions?

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

Documents

speech translation slide

input annotation modality

facial expression analysis

shore facial expression

sensible facial expression

text facial expression

speech translation va

human interaction speech