10/5/2012 1 Music Affect Recognition: The State-of-the-art and Lessons Learned Xiao Hu, Ph.D Yi-Hsuan Eric Yang, Ph.D The University of Hong Kong Academic Sinica, Taiwan ISMIR 2012 Tutorial 2 10/5/2012 1 Speaker 10/5/2012 2 Speaker 10/5/2012 3 The Audience Do you believe that music is powerful? Why do you think so? Have you searched for music by affect? Have you searched for other things (photos, video) by affect? Have you questioned the difference between emotion and mood? Is your research related to affect? 10/5/2012 4 Music Affect: 10/5/2012 5 Music Affect: 10/5/2012 6
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
10/5/2012
1
Music Affect Recognition:
The State-of-the-art and
Lessons Learned
Xiao Hu, Ph.D Yi-Hsuan Eric Yang, Ph.D
The University of Hong Kong Academic Sinica, Taiwan
ISMIR 2012 Tutorial 2
10/5/2012 1
Speaker
10/5/2012 2
Speaker
10/5/2012 3
The Audience
� Do you believe that music is powerful?
� Why do you think so?
� Have you searched for music by affect?
� Have you searched for other things (photos, video) by affect?
� Have you questioned the difference between emotion and mood?
� Is your research related to affect?
10/5/2012 4
Music Affect:
10/5/2012 5
Music Affect:
10/5/2012 6
10/5/2012
2
Music Affect:
10/5/2012 7
Music Affect:
10/5/2012 8
Agenda
� Grand challenges on music affect
� Music affect taxonomy and annotation
� Automatic music affect analysis
� Categorical approach
� Multimodal approach
� Dimensional approach
� Temporal approach
� Beyond music
� Conclusion
10/5/2012 9
Agenda
� Grand challenges on music affect
� Music affect taxonomy and annotation
� Automatic music affect analysis
� Categorical approach
� Multimodal approach
� Dimensional approach
� Temporal approach
� Beyond music
� Conclusion
10/5/2012 10
Emotion or Mood ?
10/5/2012 11
Emotion or Mood ?
� Mood: “relatively permanent and stable”
� Emotion: “temporary and evanescent”
� "most of the supposed [psychological] studies of emotion in music are actually concerned with mood and association."
Meyer, Leonard B. (1956). Emotion and Meaning in Music.
Chicago: Chicago University Press
Leonard Meyer
10/5/2012 12
10/5/2012
3
Expressed or Induced
� Designated/indicated/expressed by a music piece
� Induced/evoked/felt by a listener
� Both are studied in MIR
� Mainly differ in the ways of collecting labels
� “indicate how you feel when listen to the music”
� “indicate the mood conveyed by the music”
10/5/2012 13
Which Moods? 1/2� Different websites / studies use different terms
Thayer’s stress-energy model gives 4 clusters Farnsworth’s 10 adjective groups
Tellegen-Watson-Clark model
10/5/2012 14
Which Moods ? 2/2
� Lack of a general theory of emotions � Ekman’s 6 basic emotions:
� anger, joy, surprise disgust, sadness, fear
� Verbalization of emotional states is often a “distortion” (Meyer, 1956)� “unspeakable feelings”
� “ a restful feeling throughout ... like one of going downstream while swimming”
10/5/2012 15
Sources of Music Emotion
� Intrinsic (structural characteristics of the music)� e.g., modality -> happy vs. sad
� What about melody?
� Extrinsic emotion (semantic context related but outside the music)
� Lee et al., (2012) identified a range of factors in people’s assessment of music mood� Lyrics, tempo, instrumentation, genre, delivery, and even cultural
context
� Little has been known on the mapping of these factors to music mood
Lee, J. H., Hill, T., & Work, L. (2012) What does music mood mean for real
users? Proceedings of the iConference10/5/2012 16
Let’s ask the users… (Lee et al., 2012)
10/5/2012 17
Data, data, data!
� Extremely scarce resource
� Annotations are time consuming
� Consistency is low across annotators
� Existent public datasets on mood:� MoodSwings Turk dataset
� 240 30-sec clips; Arousal – Valence scores
� MIREX mood classification task
� 600 30-sec clips; in 5 mood clusters
� MIREX tag classification task (mood sub-task)
� 3,469 30-sec clips; in 18 mood-related tag groups
� Yang’s emotion regression dataset
� 193 25-sec clips; in 11 levels Arousal Valence scale
10/5/2012 18
10/5/2012
4
Suboptimal Performance
� MIREX Mood Classification (2012)
� Accuracy: 46% - 68%
� MIREX Tag Classification mood subtask(2011)
10/5/2012 19
Newer Challenges
� Cross-cultural applicability� Existent efforts focus on Western music
� OS1 @ ISMIR 2012 (tomorrow): Yang & Hu: Cross-cultural Music Mood Classification: A Comparison on English and Chinese Songs
� Personalization � Ultimate solution to the subjectivity problem
� Contextualization� Even the same person’s emotional responses change in different
time, location, occasions
� PS1 @ ISMIR 2012 (Tomorrow) Watson & Mandryk: Modeling Musical Mood From Audio Features and Listening Context on an In-Situ Data Set
� Commonly used on websites� Pick list; browsable directory, etc.
10/5/2012 24
10/5/2012
5
Taxonomy vs. Folksonomy
� Taxonomy
� Controlled, structured vocabulary
� Often require expert knowledge
� Top-down and bottom up approaches
� e.g.,
� Folksonomy
� Uncontrolled, unstructured vocabulary
� Social tags freely applied by users
� Commonality exists in large number of tags
� e.g.,
10/5/2012 25
Models in Music Psychology 1/2
� Categorical
� Hevner’s
adjective circle
(1936)
10/5/2012
Hevner, K. 1936. Experimental studies of the elements of expression in music. American Journal of
Psychology, 48
26
Models in Music Psychology 2/2
� Dimensional
� Russell’s
circumplex
model
10/5/2012
Russell, J. A. 1980. A circumplex model of affect. Journal of
Personality and Social
Psychology, 39: 1161-1178.
27
Borrow from Psychology to MIR
Thayer’s stress-energy model gives 4 clusters
Tellegen-Watson-Clark model
Grounded in music perception research, but lack social context of music listening (Juslin & Laukka, 2004)
Juslin, P. N. and Laukka, P. (2004). Expression, perception, and induction of musical emotions: a review and a questionnaire study of everyday listening. JNMR.
Farnsworth’s 10 adjective groups
10/5/2012 28
Taxonomy Built from Editorial Labels
� allmusic.com: “the most comprehensive music reference source on the planet”
� 288 mood labels created and assigned to music works
10/5/2012
• Editorial labels:-Given by professional editors of online repositories
-Have a certain level of control
- Rooted in realistic social contexts
29 10/5/2012 30
10/5/2012
6
Mood Label Clustering
Mood labels for albums
10/5/2012
Mood labels for songs
C1 C1C2 C2C3 C3C4 C4C5 C5
Hu, X., & Downie, J. S. (2007). Exploring Mood Metadata: Relationships with Genre, Artist and Usage Metadata. In Proceedings of ISMIR 31
� Manually compiled 120 mood words from the literature
� Crawled 6.8M social tags from last.fm
� 107 unique tags matched mood words
� 80 tags with more than 100 occurrences
10/5/2012
Laurier et al. (2009) Music mood representations from social tags, ISMIR
Most used Least used
sad rollicking
fun solemn
melancholy rowdy
happy tense
38
Laurier et al. (2009) Taxonomy from
Social Tags 2/2
10/5/2012
• Used LSA to project tag-track matrix to a space of 100 dim.
• Clustering trials with varied number of clusters
Laurier et al. (2009) Music mood representations from social tags, ISMIR
cluster 1 cluster 2 cluster 3 cluster 4
angry sad tender happy
aggressive bittersweet soothing joyous
visceral sentimental sleepy bright
rousing tragic tranquil cheerful
intense depressing quiet humorous
confident sadness calm gay
anger spooky serene amiable
+A –V -A –V -A +V +A +V
39
� Based on Laurier’s 100-dimensional space
10/5/2012
Agreement between Laurier’s and the 5 cluster taxonomy
Inter-cluster dissimilarity
C1 C2 C3 C4 C5
C1 0 .74 .13 .20 .11
C2 0 .86 .82 .88
C3 0 .32 .27
C4 0 .53
C5 0
Laurier et al. (2009) Music mood representations from social tags, ISMIR
Intra-cluster similarity
40
Summary on Taxonomy
� What are taxonomies?
� Taxonomy vs. Folksonomy
� Developing music mood taxonomies
� from Editorial Labels
� from Social Tags
10/5/2012 41
Mood Annotations
� All annotation needs three things
� taxonomy, music, people
� People
� Experts
� Subjects
� Crowdsourcing (e.g., MTurks, games)
� Derive annotations from online services
10/5/2012 42
10/5/2012
8
Expert Annotation
� The MIREX Audio Mood Classification (AMC) task
� 5 cluster taxonomy
� 1,250 tracks selected from the APM libraries
� A Web-based annotation system called E6K
10/5/2012Hu, X., Downie, J. S., Laurier, C., Bay, M., & Ehmann, A. (2008). The 2007 MIREX Audio Mood Classification Task: Lessons Learned. In ISMIR.
43
Expert Annotation: MIREX AMC
� 2468 judgments collected (3750 planned)� Each clips had 2 or 3 judgments� Avg. Cohen’s Kappa: 0.5
•Each expert had 250 clips• 8 of 21 experts finished all assignments
Agreements C1 C2 C3 C4 C5 Total
3 of 3 judges 21 24 56 21 31 153
2 of 3 judges 41 35 18 26 14 134
2 0f 2 judges 58 61 46 73 75 313
Total 120 120 120 120 120 600
Accuracy
0.59
0.38
0.54
Lessons: 1. Missed judgments -> low accuracy2. Need more motivated annotators
Dataset built from agreements among experts
10/5/2012 44
Crowdsourcing: Amazon Mechanic Turk
• Lee & Hu (2012): compare expert and MTurk annotations
• The same 1,250 music clips as in MIREX AMC
• The same 5 clusters
• Annotators: “Turkers” who work on human intelligent tasks for very low payment
• Advantages of MTurk
• Plenty of labor
• Disadvantages of MTurk
• Quality control
Lee, J. H. & Hu, X. (2012) Generating Ground Truth for Music Mood Classification Using Mechanical Turk, In Proceedings of Joint Conference on Digital Libraries10/5/2012 45
� MoodSwings (Kim et al., 2008) � 2-player Web-based game to collect
annotations of music pieces in the arousal- valence space
� Time-varying annotations are collected at a rate of 1 sample per second
� Players “score” for agreement with their competitor
10/5/2012
Kim, Y. E., Schimdt, E., and Emelle, L. (2008). Moodswings: a collaborative game for music mood label collection, ISMIR
52
MoodSwings: Challenges
� Needs a pair of players� Simulated AI player
� Randomly following the real player � less challenging
� Based on prediction model � need training data
� Attracting players (true for all games)� Must be challenging and fun
� Music: more recent and entertaining
� Game interface: sleek, aesthetic
� Research values� Variety of music and mood
10/5/2012
B. G. Morton, J. A. Speck, E. M. Schmidt, and Y. E. Kim (2010). Improving music emotion labeling using human computation,” in HCOMP 53
MoodSwings: MTurk version
10/5/2012
Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative study of collaborative vs. traditional music mood annotation, ISMIR
• Single person game
• No competition, no scores
• Monetary reward
(0.25 USD/11 pieces)
• Consistency check:
-- 2 identical pieces whose labels must be within experts’ decision boundary
-- must not label all clips the same way
54
10/5/2012
10
MoodSwings: 2 version Comparison
Label
Corr.
V: 0.71
A: 0.85
10/5/2012
Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative study of collaborative vs. traditional music mood annotation, ISMIR 55
� Aucouturier & Pachet (2004) “Semantic Gap” between low-Level music feature
and high-level human perception
� MIREX AMC performance (5 classes)
Year Top 3 accuracies2007 61.50%, 60.50%, 59.67%
2008 63.67%, 58.20%, 56.00%
2009 65.67%, 65.50%, 63.67%
2010 63.83%, 63.50%, 63.17%
2011 69.50%, 67.17%, 66.67%
2012 67.83%, 67.67%, 67.17%
Aucouturier, J-J., & Pachet, F. (2004), Improving timbre similarity: How high is the sky? Journal of Negative. Results in Speech and Audio Sciences, 1 (1).
10/5/2012 71
Multimodal Classification
MUSIC
Audio Lyrics
Social Tags
Improving classification performance by combining multiple independent sources
Bischoff et al. 2009
Yang & Lee, 2004Laurie et al, 2009
Hu & Downie, 2010
Metadata
Schuller et al. 2011
10/5/2012 72
10/5/2012
13
Lyric Features� Basic features:
� Content words, part-of-speech, function words
� Lexicon features:� Words in WordNet-Affect
� Psycholinguistic features:� Psychological categories in GI (General
Inquirer)
� Scores in ANEW (Affective Norm of English Words)
� Stylistic features:� Punctuation marks; interjection words
� Statistics: e.g., how many words per minute
Hu, X. & Downie, J. S. (2010) Improving Mood Classification in Music Digital Libraries by Combining Lyrics and Audio, JCDL
ANEW examples
Valence
Arousal
Dominance
Happy 8.21 6.49 6.63
Sad 1.61 4.13 3.45
Thrill 8.05 8.02 6.54
Kiss 8.26 7.32 6.93
Dead 1.94 5.73 2.84
Dream 6.73 4.53 5.53
Angry 2.85 7.17 5.55
Fear 2.76 6.96 3.22
10/5/2012 73
Lyric Feature Example
GI Feature Description Example
WlbPhys words connoting the physical aspects of well
being, including its absence
blood, dead, drunk,
pain
Perceiv words referring to the perceptual process of
recognizing or identifying something by
means of the senses
dazzle, fantasy, hear,
look, make, tell, view
Exert action words hit, kick, drag, upset
TIME words indicating time noon, night, midnight
COLL words referring to all human collectivities people, gang, party
WlbLoss words related to a loss in a state of well
being, including being upset
burn, die, hurt, mad
Top General Inquire (GI) features in category “Aggressive”
10/5/2012 74
No significant difference
between top combinations
Lyric
Classification
Results10/5/2012 75
Distribution of feature “!”
10/5/2012 76
Distribution of feature “hey”
10/5/2012 77
“number of words per minute”
10/5/2012 78
10/5/2012
14
Combine with Audio-based Classifier
� A leading system in MIREX AMC 2007 and 2008: Marsyas
� Music Analysis, Retrieval and Synthesis for Audio Signals
� led by Prof. Tzanetakis at University of Victoria
� Uses audio spectral features
� marsyas.info
� Finalist in the Sourceforge Community Choice Awards 2009
10/5/2012 79
Hybrid Methods
– Late fusion
– Feature concatenation (early fusion)
Classifier
Prediction
Lyric Classifier
Audio Classifier
Prediction
Prediction
Final Prediction
Dominate due to clarity and the avoidance of “curse of dimensionality”
10/5/2012 84Hu & Downie (2010) When Lyrics Outperform Audio for Music Mood Classification: A Feature Analysis, ISMIR
10/5/2012
15
Top Lyric Features
10/5/2012 85
Top Lyric Features in “Calm”
10/5/2012 86
vs.
Top AffectiveTop AffectiveTop AffectiveTop AffectiveWordsWordsWordsWords
10/5/2012 87
Other Textual Features used in
Music Mood Classification
� Based on SentiWordNet
� assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity
� Simple Syntactic Structures
� Negation, modifier
� Lyric rhyme patterns (inspired by poems)
� Contextual features (Beyond lyrics)
� Social tags, blogs, playlists, etc.
10/5/2012 88
Cross-cultural Mood Classification
� Tomorrow, Oral Session 1
Yang & Hu (2012) Cross-cultural Music Mood Classification: A Comparison on English and Chinese Songs, ISMIR
Cross cultural model applicability:-23 mood categories based on AllMusic.com- Train on songs in one culture and classify songs in the other
10/5/2012 89
Summary of Categorical and
Multimodal Approaches� Natural language labels are intuitive to end users
� Based on supervised learning techniques
� Studies mostly focusing on Feature Engineering
� Multimodal approaches improve performances
� Effectiveness and Efficiency
� Cross-cultural mood classification: just started
� Challenges
� Ambiguity inherent in terms (Meyer’s “distortion”)
� Hierarchy of mood categories
� Connections between features and mood categories
10/5/2012 90
10/5/2012
16
Agenda
� Grand challenges on music affect
� Music affect taxonomy and annotation
� Automatic Music affect analysis
� Categorical approach
� Multimodal approach
� Dimensional approach
� Temporal approach
� Beyond music
� Conclusion
10/5/2012 91
Dimensional Approach
� What is and why dimensional model
� Computational model for dimensional music emotion recognition
� Issues
� Difficulty of emotion rating
� Subjectivity of emotion perception
� Context of music listening
� Usability of UI
10/5/2012 92
Categorical Approach
Hevner’ model (1936)
Audio spectrum
10/5/2012 93
Circumplex model
(Russell 1980)
Audio spectrum
Dimensional Approach
10/5/2012 94
What is the Dimensional Model� Alternative conceptualization of
emotions based on their placement along broad affective dimensions
� It is obtained by analyzing “similarity ratings” of emotion words or facial expression by factor analysis or multi-dimensional scaling� For example, Russell (1980) asked
343 subjects to describe their emotional states using 28 emotion words and use four different methods to analyze the correlation between the emotion ratings
� Many studies identifies similar dimensions
10/5/2012 95
The Valence-Arousal (VA) Emotion Model
○ Energy or neurophysiological stimulation level
Activation‒Arousal
Evaluation‒Valence○ Pleasantness○ Positive and
negative affective states
[psp80]10/5/2012 96
10/5/2012
17
More Dimensions
� The world of emotions is not 2D(Fontaine et al., 2007)
� 3rd dimension: potency‒control
� Feeling of power/weakness; dominance/submission
� Anger ↔ fear
� Pride ↔ shame
� Interest ↔ disappointment
� 4th dimension: predictability
� Surprise
� Stress↔ fear
� Contempt ↔ disgust
� However, 2D model seems to work fine for music emotion
10/5/2012 97
Why the Dimensional Model 1/3
� Free of emotion words
� Emotion words are not always precise and consistent� We often cannot find proper words to express our feelings
� Different people have different understandings to the words
� Emotion words are difficult to translate and might not exist with the exact same meaning in different languages (Russell 1991)
� Semantic overlap between emotion categories� Cheerful, happy, joyous, party/celebratory
� Melancholy, gloomy, sad, sorrowful
� Difficult to determine how many and what categories to be used in a mood classification system
10/5/2012 98
No Consensus on Mood Taxonomy in MIRWork # Emotion description
Katayose et al [icpr98] 4 Gloomy, urbane, pathetic, serious
� Useful for mobile devices with small display space
⊳ Demoarousalarousal
valencevalence
Y.-H. Yang, Y.-C. Lin, H.-T. Cheng, and H.-H. Chen (2008) Mr. Emo: Music retrieval in the emotion plane, Proc. ACM Multimedia
10/5/2012 116
How to Further Improve the Accuracy
� Larger dataset
� Use higher-level features � Articulation, pitch contour, melody direction, tonality (Gabrielsson
and Lindström, 2010)
� High-level music concepts (tags)
� Lyrics
� Better understanding of human perception of valence
� Consider the correlation between arousal and valence� Output associative relevance vector machine (Nicolaou et al., 2012)
� Exploit the temporal information
� Long short-term memory neural networks (Weninger et al., 2011),
or HMM (hidden Markov model)
10/5/2012 117
Issue 1: Difficulty of Emotion
Annotation� Rating emotion is difficult
� User fatigue
� Uniform ratings
� Low quality of the ground truth
� Difficult to create large-scale dataset
Trainingdata
Manual annotation
Feature extraction
Emotion value
Regressor training
Feature
Testdata
Feature extraction
Automatic Prediction
Feature
Regressor
Emotion value
user A user B
10/5/2012 118
AnnoEmo: GUI for Emotion Rating
� Encourages differentiation
Click to listen again
Drag & drop to modify
annotation
⊳ Demo
Y.-H. Yang, Y.-F. Su, Y.-C. Lin, and H.-H. Chen (2007) Music emotion recognition: The role of individuality, Proc. Int. Works. Human-centered Multimedia10/5/2012 119
Sol: Ranking Instead of Rating
� Determines the position of a song
� By the relative ranking with respect to other songs
� Strength
� Ranking is easier than rating
� Encourages differentiation
� Avoids inconsistency
� Enhances the qualityof the ground truth
Oh Happy Day
I Want to Hold Your Hand by Beatles
I Feel Good by James Brown
What a Wonderful World by Louis Armstrong
Into the Woods by My Morning Jacket
The Christmas Song
C'est La Vie
Labita by Lisa One
Just the Way You Are by Billy Joel
Perfect Day by Lou Reed
When a Man Loves a Woman by Michael Bolton
Smells Like Teen Spirit by Nirvana
relativeranking
10/5/2012 120
10/5/2012
21
Ranking-Based Emotion Annotation
� Emotion tournament � Requires only n–1 pairwise comparisons
� The global ordering can later be approximated by a greedy algorithm [jair99]
� Use machine learning (“learning-to-rank”)to train a model that ranks songs according to emotion
a b c d e f g h
Which songs is more positive?
Y.-H. Yang and H.-H. Chen (2011) Ranking-based emotion recognition for music organization and retrieval, IEEE TASLP 19(4)
10/5/2012 121
Issue 2: Emotion Perception is Subjective
� A song can be perceived differently by two people, especially for emotions in the 3rd and 4th quadrants
� Also explain why valence prediction is more challenging
(a) Smells Like Teen Spirit
(b) A Whole New World
(c) The Rose (d) Tell Laura I Love Her
10/5/2012 122
Subjectivity of Emotion Perception
Each circle represents the emotion annotation for a music piece by a subject
10/5/2012 123
Sol: Personalized MER
� Need to have some emotion annotation of the target user to train a personalized model
* Significant improvement over the general method (p<0.01)
|Ф|
Y. –H. Yang et al. (2009) Personalized music
emotion recognition, ACM SIGIR
10/5/2012 126
10/5/2012
22
Issue 3: Context of Music Listening
� Listening mood/context
� Familiarity/associated memory
� Preference of the singer/performer/song
� Social context (alone, with friends, with strangers)
10/5/2012 127
Issue 4: Usability of UI 1/2
10/5/2012 128
� A new user may not be familiar with the meaning of valence and arousal
Issue 4: Usability of UI 1/2
� A new user may not be familiar with the meaning of valence and arousal => display mood tags
10/5/2012 129
J.-C. Wang et al. (2012) Exploring the relationship between categorical and dimensional emotion semantics of music, Proc. MIRUM
Issue 4: Usability of UI 2/2
� Issues
� How to automatically map emotion words to the VA space
� How to “personalize” the above mapping� Different people may have different interpretation of words
� Let user determine which emotion word to be displayed � The emotion word can be used as a shortcut to organize user’s
music collection over the VA space
� Some preliminary study has been reported (Wang et al, 2012)
J.-C. Wang, Y.-H. Yang, K.-C. Chang, H.-M. Wang, and S.-K. Jeng (2012) Exploring the relationship between categorical and dimensional emotion semantics of music, Proc.
MIRUM10/5/2012 130
Summary of Dimensional Approach
� The dimensional approach as an alternative to the categorical approach
� Free of ambiguity, granularity issues
� A ready canvas for visualization, retrieval, and browsing
� Easy to track emotion variation within a music piece
� Regression-based computational method
� Issues
� Replacing the difficult task of emotion rating by ranking
� Address individual difference by personalization
� Need to take user context into account
� Enhance usability by displaying mood labels10/5/2012 131
Agenda
� Grand challenges on music affect
� Music affect taxonomy and annotation
� Automatic Music affect analysis
� Categorical approach
� Multimodal approach
� Dimensional approach
� Temporal approach
� Beyond music
� Conclusion
10/5/2012 132
10/5/2012
23
Temporal Emotion Variation� Emotion may change as a
musical piece unfolds� The chorus part is usually of
greater emotional intensity than the verse
� Failure of accounting the dynamic (time-varying) nature of music emotion may limit the performance of MER
� The combination of ‘anticipation’ and ‘surprise’ is important for our enjoyment of music (Huron, 2006)
10/5/2012 133
Temporal Aspect is Usually Neglected
� A 10–30 second segment is often used to represent the whole song� To reduce the emotion variation within the segment
� To lessen the burden of emotion annotation on the subjects
� The segment is selected� Manually by picking the most representative part
� By identifying the chorus section automatically
� By selecting the middle 30 seconds
� By selecting the [30, 60] segment
� Short-time features are pooled (e.g., by taking mean and STD) over the all segment, leading to a segment-level feature vector
10/5/2012 134
ComparisonApproach Emotion
annotationFeature extraction
Model training Prediction
Music emotion recognition
Segment-level (10-30 sec)
Segment-level vector or bag-of-frame
Segment-level A segment-level, static, estimate that summarizesthe emotion of the whole song
Musicemotion tracking
Second-level(1-3 sec)
Second-levelvector
Second-level;make use of the temporal relationship
Moment-to-moment “continuous” emotion variation within a song
� In contrast, post-performance response assumes that music emotion can be understood by collecting affectiveresponses after a musical stimulus has been sounded
� Duration neglect: post-performance (remembered) rating ≠ averaged continuous-response data (Duke and Colprit, 2001)
� Post-performance ratings were generally higher (close to the “peak” or “end” experience, rather than the averaged one)
� Post-performance experience is information poor; listeners are forced to “compress” their response by giving an overall impression
10/5/2012 136
Issues� The emotion of popular music may not change very much or
often (Schmidt and Kim 2008)
� Collecting moment-to-moment response is interrupting and labor-intensive
� Gathering responses along 2 or more scales in the same pass may incur excessive cognitive load on the participant
� Lag structure
� There is a variable “reaction lag” between events and subject responses (Schubert 2001)
� Arousal responses follow the path of loudness with a delay of 2 to 4 seconds (Schubert 2004)
10/5/2012 137
Annotation Examples
� MoodSwings Turk Dataset (Schmidt and Kim, 2011)� 240 15-second clips annotated with Mechanical Turk
Live – Waitress Billy Joel– Captain Jack
10/5/2012 138
10/5/2012
24
Computational Approaches 1/2� The VA approach: predict the emotion values for every one
or two second
� Correspondingly, participants are asked to rate the VA values every one or two seconds� It is easier to track the continuous changes of emotional expression
by rating VA values instead of labeling mood classes (Gabrielsson, 2002)
� Algorithms� Regression
(neglect temporal info)
� Time-series analysis(make use of temporal info.) (Schubert, 1999)(Korhonen et al., 2006)(Schmidt and Kim, 2010)
10/5/2012 139
Time-Series Analysis
� Autoregression with extra inputs (ARX): learn Ak and Bk
valence (arousal) at time t valence (arousal) at previous time
feature at time t feature at previous time (lagged input) noise
Correlation between output values (V & A) y
Relationship between V, A and input values u(music features)
M. D. Korhonen, D. A. Clausi, and M. E. Jernigan (2006) Modeling emotional content of music using system identification, IEEE TSMC 36(3)10/5/2012 140
Block-wise Prediction
� Make a prediction for each short-time segment
� Fixed-length sliding window (Yang et al., 2006)
� Adaptive window (Lu et al., 2006)
Detect emotion change
boundariesSegmentation
Make a prediction for each segment
• Each contains a constant mood• The minimum length of a segment is set to 16
second for classical music (Lu et al., 2006)
10/5/2012 141
Subjective Issue� Listener ratings collected continuously in response to
music are notoriously diverse (Upham and McAdams, 2009)
Thirty audience members’ ratings, in colour, of Experienced Emotional Intensity during a live performance of Mozart’s Overture to Le nozze di Figaro and the average rating for this population in black (Upham and McAdams, 2009)
10/5/2012 142
Computational Approaches 2/2� The VA-Gaussian approach: model the emotion expression
of each time instance as a Gaussian distribution to take into account the subjectivity issue� Predict the time-dependent distribution parameters N(μ; Σ)
� Algorithm� Regression on the five parameters (Yang and Chen 2011)
� Kalman filter (Schmidt and Kim 2010)
� Acoustic emotion Gaussians(Want et al., 2012)
“American Pie” (Don McLean) 10/5/2012 143
Wang et al. (2012) “The Acoustic Emotion Gaussians model for emotion-based music annotation and retrieval,” Proc. ACM MM
Application of Emotion Tracking
� Specify a trajectory to indicate the desired emotion variation within a musical piece
A. Hanjalic (2006) Extracting moods from pictures and sounds: Towards truly personalized TV, IEEE Signal Processing Magazine10/5/2012 144
10/5/2012
25
Summary of Temporal Approach� The emotion of music changes as time unfolds
� Tracking the second-by-second emotion variation as a trajectory
� Modeling emotion as a stochastic Gaussian distribution
� Issues
� Difficulty of gathering reliable annotation
(ranking is not useful here)
� Lag structure in emotion perception
� The effect of expectation, surprise
� Subjective differenceSweet anticipation by
David Huron
10/5/2012 145
Agenda
� Grand challenges on music affect
� Music affect taxonomy and annotation
� Automatic Music affect analysis
� Categorical approach
� Multimodal approach
� Dimensional approach
� Temporal approach
� Beyond music
� Conclusion
10/5/2012 146
Emotion in Images� The International Affective Picture System (IAPS)
10/5/2012 147
Emotions in Videos
� Need to depict the emotion variation along the
time (Zhang et al., 2009)
10/5/2012 148
Emotion Recognition in
Image/Video
� Image emotion recognition
� Categorical approach
� Dimensional approach
� Video emotion recognition
� Temporal approach
� Make use of dynamic attributes
such as motion intensity and
shot change rate
10/5/2012 149
Visual Features
10/5/2012 150
10/5/2012
26
Categorical Approach� The “categorical” approach
� Horror event detection in videos (Moncrieff et al., 2001)
� Affect detection in movies (Kang, 2002)
� Colors 'yellow', 'orange' and 'red' correspond to the 'fear' & 'anger'
� Color 'blue', 'violet' and 'green' can be found when the spectator feels 'high valence' and 'low arousal'
Class startle apprehension surprise climax
Precision 63% 89% 93% 80%
Class fear sadness joy
Precision 81% 77% 78%
10/5/2012 151
Dimensional/Temporal Approach
� Valence� Lighting key
� Color (saturation, color energy)
� Rhythm regularity
� Pitch
� Arousal� Shot change rate
� Motion Intensity
� Sound energy
� Rhythm-based features
� Tempo and beat strength
10/5/2012 152
Application: Affective Video Recommendation
� iMTV (Zhang et al., 2010)
10/5/2012 153
Application: Video Content Representation
(Hanjalic and Xu, 2005)
10/5/2012 154
Explicit (Self-Report) vs. Implicit Tagging
� Focus on felt emotion
10/5/2012 155
Bio-Sensors and Features
� Facial expression
� Blood volume pulse
� Respiration pattern
� Skin temperature
� Skin conductance
� EEG (ElectroEncephaloGram)
� EMG (ElectroMyoGram)
10/5/2012 156
10/5/2012
27
The DEAP Dataset 1/2
Music video recommendation system
User
Bodily Responses(EEG/
peripheral)Emotion
User’s taste
Recommended Music Video
Koelstra et al. (2012) “DEAP: A Database for Emotion Analysis using Physiological Signals,” IEEE Trans. Affective Computing
• Both explicit and implicit tagging
10/5/2012 157
The DEAP Dataset 2/2
40 one-minMusic Video
32 Participants
Multimedia content feat.
EEG
Physiological Signals
Rating
Dominance
Valance
Arousal
Liking
Familiarity
EEG feat.
Phys. Feat.
MusicVideos
Face Video
Correlation
Classification
10/5/2012 158
Correlation of EEG and Rating
Koelstra, “DEAP: A Database for Emotion Analysis using Physiological Signals”
Co
rrelation
Theta Alpha Beta Gamma
10/5/2012 159
Summary of Visual Emotion
Recognition� Commonality
� Similar emotion models and computational method (categorical, dimensional, temporal)
� Similar audio features
� Similar result (valence is more difficult than arousal)
� Difference� Prefer the temporal approach
� More focus on highlight extraction
� More focus on the “implicit” tagging of emotion, especially the study of physiological signals
10/5/2012 160
Agenda
� Grand challenges on music affect
� Affect categories and labels
� Music affect analysis
� Categorical approach
� Multimodal approach
� Dimensional approach
� Temporal approach
� Beyond music
� Conclusion
10/5/2012 161
Related Books
Sweet Anticipation: Music and the Psychology of Expectation
The Oxford Handbook of Music Psychology
Music Emotion Recognition
Handbook of Music and Emotion: Theory, Research, Applications
10/5/2012 162
10/5/2012
28
Recent Survey
� Y. E. Kim et al. (2010), “Music emotion recognition: A state of the art review,” in Proc. ISMIR.
� Y.-H. Yang and H.-H. Chen (2012) “Machine recognition of music emotion: A review,” ACM Trans. Intel. Systems & Technology, 3(4)
� M. Barthet et al. (2012) “Multidisciplinary perspectives on music emotion recognition: Implications for content and context-based models,” in Proc. CMMR
� Z. Zeng et al. (2009) “A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions,” IEEE Trans. Pattern Anal. & Machine Intel., 31(1)