This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
Expression of emotions in Speech synthesisExpression of emotions in Speech synthesisMarc Schröder, DFKI, Marc Schröder, DFKI, [email protected]@dfki.de
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
22
OverviewOverviewChallenge: Challenge:
RReal-time eal-time ssystem for “ystem for “real-life” emotional speech real-life” emotional speech detection in order detection in order
to build an affectively competent agentto build an affectively competent agent
Emotion is considered in the broad senseEmotion is considered in the broad sense
Real-life emotions are often shaded, blended, masked Real-life emotions are often shaded, blended, masked emotions due to social aspectsemotions due to social aspects
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
33
State-of-the-artState-of-the-art• Static emotion detection system (emotional unit level: Static emotion detection system (emotional unit level: word, chunk, sentence)word, chunk, sentence)• Statistical approach (such as SVM) using large amount of Statistical approach (such as SVM) using large amount of data to train modelsdata to train models• 4-6 emotions detected, rarely more4-6 emotions detected, rarely more
Emotion detection
P(Ei /O)
0: 0bservation
E models
The scheme shows the components of an automatic emotion recognition system The performances on realistic data (CEICES): 2 emotions > 80% 4 emotions >60%
Extractionfeatures
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
The difficulty of the detection task increases The difficulty of the detection task increases with the variability of the emotional speech with the variability of the emotional speech expression.expression.
4 dimensions:4 dimensions:
• Speaker (dependent/independent, age, Speaker (dependent/independent, age, gender, health), gender, health), • Environment (transmission channel, noise Environment (transmission channel, noise environment),environment),• Number and type of emotions (primary, Number and type of emotions (primary, secondary)secondary)• Acted/real-life data and applications Acted/real-life data and applications context context
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
55
Automatic emotion detection: Research Automatic emotion detection: Research evolutionevolution 20072003
Emotion in interaction .>5 Real emotions .• >4 acted-emotions
WoZ .Call center data .HMI .
Voice Superposition
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
66
Challenge with spontaneous Challenge with spontaneous emotionsemotions
• Authenticity is present but there is no control on the emotion Authenticity is present but there is no control on the emotion • Need to find appropriate labels and measures for annotation validationNeed to find appropriate labels and measures for annotation validation• Blended emotions (Scherer: Blended emotions (Scherer: Geneva Airport Lost Luggage StudyGeneva Airport Lost Luggage Study ))
Annotation and Validation of annotationAnnotation and Validation of annotation• Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)Expert annotation phase by several coders (10 coders, CEICES (5 coders), often only two)• Control of the quality of annotations:Control of the quality of annotations:
•Validate the annotation scheme and the annotationsValidate the annotation scheme and the annotationsPerception of emotion mixtures (40 subjects) NEG/POS valencePerception of emotion mixtures (40 subjects) NEG/POS valenceImportance of the context Importance of the context
•Give measure for comparing human perception with automatic detection.Give measure for comparing human perception with automatic detection.
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
MoviesMovies 7h7h 400 speakers400 speakers Fear, other Neg, Fear, other Neg, Pos.Pos.
Audio
AudioVisuel
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
88
Context-dependent emotion labelsContext-dependent emotion labelsDo the labels represent the emotion of a considered task or context?Do the labels represent the emotion of a considered task or context?
Example: Real-life emotion studies (call center):Example: Real-life emotion studies (call center):The Fear label represents different expressions of Fear due to different The Fear label represents different expressions of Fear due to different
contexts: contexts: Fear for callers of losing money, Fear for callers for life, Fear for Fear for callers of losing money, Fear for callers for life, Fear for
agents of mistaking agents of mistaking
The difference is not just a question of intensity/activation The difference is not just a question of intensity/activation -> Primary/Secondary fear ? -> Primary/Secondary fear ? -> Degree of Urgency/reality of the threat ?-> Degree of Urgency/reality of the threat ?Fear in the fiction (movies): study of many different contextsFear in the fiction (movies): study of many different contexts
How to generalize ? Should we define labels in function of the type of context?How to generalize ? Should we define labels in function of the type of context?We just defined the social role (agent/caller) as a contextWe just defined the social role (agent/caller) as a context
See Poster of C. ClavelSee Poster of C. Clavel
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
99
Emotional labelsEmotional labels• The majority of the detection systems uses emotion The majority of the detection systems uses emotion
discrete representationdiscrete representation• Need a sufficient amount of data. In that objective, we Need a sufficient amount of data. In that objective, we
use hierarchical organization of labels (LIMSI example)use hierarchical organization of labels (LIMSI example)
LIMSI: Results with paralinguistic cues (SVMs): from 2 to 5 emotion classes (% of good detection)
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1313
25 best features 25 best features for 5 emotions detectionfor 5 emotions detection
feature type # of cues in the 25 bests F0 related 4
Energy 5 Microprosody 4
Formants 2 Duration from phonetic alignment 4
Other cues from transcription 6
The difference of the media channel (phone/microphone), the type of data (adult vs. children, realistic vs. naturalistic) and the emotion classes have an impact on the best relevant set of features.
Out of our 5 classes, Sadness is the least recognized without mixing the cues.
Features from all the classes were selected (different from one class to another)
Anger,Fear,Sadness,ReliefNeutral state
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1414
Real-life emotional systemReal-life emotional system
System based on acted data -> inadequate for real-life data detection System based on acted data -> inadequate for real-life data detection (Batliner)(Batliner)
GEMEP/CEMO comparison: different emotions GEMEP/CEMO comparison: different emotions First experiments show only an acceptable detection score for Anger.First experiments show only an acceptable detection score for Anger.
Real-life emotion studies are necessaryReal-life emotion studies are necessary
Detection results on call center data: state of the art for « realistic Detection results on call center data: state of the art for « realistic emotions »emotions »
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1515
Short-term:Short-term:Acceptable solutions for targeted applications are in reachAcceptable solutions for targeted applications are in reachUse dynamic model of emotion for real-time emotion detection (history Use dynamic model of emotion for real-time emotion detection (history memory)memory)New features: Automatically extracted information on voice New features: Automatically extracted information on voice quality, affect bursts and disfluences from the signal that quality, affect bursts and disfluences from the signal that does not does not require exact speech recognitionrequire exact speech recognition..Detect relaxed/tensed voice (Scherer)Detect relaxed/tensed voice (Scherer)Add contextual knowledge to the blind statistical model: social Add contextual knowledge to the blind statistical model: social role, type of action, regulation (role, type of action, regulation (adapt emotional expression to strategic interaction goals (faces theory, Goffman)).
Long-term Long-term Emotion dynamic processus based on appraisal model.Emotion dynamic processus based on appraisal model.Combining informations at several levels: acoustic/linguistic, Combining informations at several levels: acoustic/linguistic, multimodal cues, adding contextual informations (social role)multimodal cues, adding contextual informations (social role)
Challenges aheadChallenges ahead
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007
1616
Demo (coffee break…)Demo (coffee break…)
L-Devillers - Plenary L-Devillers - Plenary 5 juin 2007 5 juin 2007