Top Banner
1 The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval Ju-Chiang Wang, Yi-Hsuan Yang, Hsin-Min Wang, and Skyh-Kang Jeng Academia Sinica, National Taiwan University, Taipei, Taiwan
27

The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

May 24, 2015

Download

Technology

Ju-Chiang Wang

One of the most exciting but challenging endeavors in music research is to develop a computational model that comprehends the affective content of music signals and organizes a music collection according to emotion. In this paper, we propose a novel \emph{acoustic emotion Gaussians} (AEG) model that defines a proper generative process of emotion perception in music. As a generative model, AEG permits easy and straightforward interpretations of the model learning processes. To bridge the acoustic feature space and music emotion space, a set of \emph{latent feature classes}, which are learned from data, is introduced to perform the end-to-end semantic mappings between the two spaces. Based on the space of latent feature classes, the AEG model is applicable to both automatic music emotion annotation and emotion-based music retrieval. To gain insights into the AEG model, we also provide illustrations of the model learning process. A comprehensive performance study is conducted to demonstrate the superior accuracy of AEG over its predecessors, using two emotion annotated music corpora MER60 and MTurk. Our results show that the AEG model outperforms the state-of-the-art methods in automatic music emotion annotation. Moreover, for the first time a quantitative evaluation of emotion-based music retrieval is reported.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

1

The Acoustic Emotion Gaussians Model for Emotion-based Music

Annotation and Retrieval

Ju-Chiang Wang, Yi-Hsuan Yang, Hsin-Min Wang, and Skyh-Kang Jeng

Academia Sinica, National Taiwan University,Taipei, Taiwan

Page 2: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

2

Outline

• Introduction • Related Work• The Acoustic Emotion Gaussians (AEG)

Model• Music Emotion Annotation and Retrieval • Evaluation and Result• Conclusion and Future Work

Page 3: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

3

Introduction• One of the most exciting but challenging

endeavors in music information retrieval (MIR)– Develop a computational model that comprehends

the affective content of music signals

• Why is emotion so important to MIR system?– Music is the finest language of emotion– We use music to convey or modulate emotion– Smaller semantic gap, comparing to genre– Each state in our daily life contains emotion,

context-dependent music recommendation

Page 4: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

4Dimensional Emotion:

The Valence-Arousal(Activation) Model• Emotions are considered as numerical values (instead of

discrete labels) over a number of emotion dimensions• Good visualization, intuitive, a unified model• Easy to capture temporal change of emotion

Mufin Player

Mr. Emo developed by Yang and Chen

Page 5: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

5

The Valence-Arousal Annotation

• Emotion is subjective, different emotion may be elicited from a song in the VA space

• Assumption: the VA annotation of a song can be drawn from a Gaussian distribution, as observed above

• Subjectivity issue: observed by multiple subjects• Temporal change: summarize the scope of changes

Page 6: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

6Related Work:

Regression for Gaussian Parameters

• The Gaussian-parameter approach directly learns five regression models to predict the mean, variance, and covariance of valence and arousal, respectively

• Without a joint modeling and estimation for the Gaussian parameters

x

Regressor 1

Regressor 2

Regressor 3

Regressor 4

Regressor 5

mVal

mAro

sVal-Aro

sAro-Aro

sVal-Val

Page 7: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

7

The Acoustic Emotion Gaussians Model for Modeling between VA and Acoustic Feature

• A principled probabilistic/statistical approach• Represent the acoustic features of a song by a probabilistic

histogram vector• Develop a model to comprehend the relationship between acoustic

features and VA space (annotations)

Acoustic GMM Posterior Distributions

Page 8: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

8

AEG: Construct Feature Reference Model

Global Set of frame vectors randomlyselected from each track

…A1 N2NK-1

NK N3N4

Global GMM for acoustic feature encoding

EM Training

A Universal Music Database

Acoustic GMM

Music Tracks& Audio Signal

Frame-based Features

… …

… …

Page 9: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

9

Represent a Song into Probabilistic Space

1

2

K-1

K…

Posterior Probabilities over the Acoustic GMM

A1

A2

AK-1

Acoustic GMM

AK

Feature Vectors Histogram:Acoustic GMM Posterior

prob

Each dim corresponds to a specific acoustic pattern, called a latent feature class (or audio word)

1 2 K-1 K…

Page 10: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

10

Generative Process of VA GMM

• Key idea: Each component VA Gaussian corresponds to a latent feature class (a specific acoustic pattern)

Audio Signal of Each Clip

A Mixture of Gaussians in the VA Space

A1

A2

AK-1

Acoustic GMM

AK

1

2

K-1

K…

Page 11: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

11

Total Likelihood Function of VA GMM

• To cover the subjectivity, each training clip is annotated by multiple subjects {uj}, the corresponding annotation ej

• An annotated corpus: assume each annotation eij of clip si can be generated by a weighted VA GMM with {qik}!

• Generating the Corpus-level likelihood and maximize it using the EM algorithm

1 1 1 1

( | ) ( | ) ( | , )jU KN N

i ik ij k ki i j k

p p s q= = = =

= = å E E e m S

1

( | ) ( | , )K

ij i ik ij k kk

p s q=

=åe e Sm

Acoustic GMM posterior

Clip-level likelihood:Each annotation contributes equally

parameters of each latent VA Gaussian to learn

Annotation-levelLikelihood

Page 12: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

12

User Prior Model

• Some annotations could be outliers

• The prior weight of each annotation can be described by the likelihood over the clip-level annotation Gaussian– Larger B indicates lower label consistency (higher uncertainty)– Smaller likelihood implies the annotation could be an outlier

( | , ) ( | , , )jp u s s=e e a B

,

( | , )( | )

( | , )j

j s jju

p u sp u s

p u sg ¬ =

åe

e

Page 13: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

13

Integrating the Annotation (User) Prior

• Integrating Acoustic GMM Posterior and Annotation Prior into the Generative Process

1 1 1

1 1 1

( | ) ( | ) ( | ) ( | )

( | , )

j

j

UN N

i ij i ij ii i j

U KN

ij ik ij k ki j k

p p s p u s p s

g q

= = =

= = =

= =

=

å

å å

E E e

e

m S Clip-level likelihood:prior weighted sum over annotation-level likelihood

Annotation Prior Acoustic GMM posterior

Page 14: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

14

The Objective Function

• Take log of p(E| ), and according to Jensen’s inequality we derive the lower bound

where

• Then, we maximize Lbound with the EM-Algorithm

1 1 1

1 1 1

log ( | ) log ( | , )

log ( | , )

j

j

UN K

ij ik ij k ki j k

UN K

bound ij ik ij k ki j k

p D

L

g q

g q

= = =

= = =

=

³ =

å å å

åå å

E e

e

m

m

S

S

1 1

1jUN

iji j

g= =

=åå

two-layer log sum

one-layer log sum

parameters to learn

Page 15: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

15

The Learning of VA GMM on MER60Iter=8Iter=4

Iter=32Iter=16

Iter=2

Page 16: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

16

Music Emotion Annotation

• Given the acoustic GMM posterior of a test song, predict the emotion as a single VA Gaussian

1

2

K-1

K

Acoustic GMM Posterior Learned VA GMM Predicted Single Gaussian

1

ˆˆ( | ) ( | , )K

k ij k kk

p s q=

=åe e m S

^

^

^

^

{ , }*m *S

Page 17: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

17

Find the Representative Gaussian

• Minimize the cumulative weighted relative entropy– The representative Gaussian has the minimal cumulative

distance from all the component VA Gaussians

• The optimal parameters of the Gaussian are

( )KL{ , }

1

ˆ( | , ) argmin ( | , ) || ( | , )K

k k kk

p D p pq* *

=

= åe e eS

S S Sm

m m m

*

1

ˆ K

k kk

q=

= åm m

( )* * *

1

ˆ ( )( )K

Tk k k k

k

q=

= + - -åS S m m m m

Page 18: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

18

Emotion-Based Music Retrieval

Approach Indexing MatchingFold-In Acoustic GMM Posterior Cosine Sim (K-dim)

Emotion Prediction Predicted VA Gaussian Gaussian Likelihood

Page 19: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

19

The Fold-In Approach

l1

l2

lK-1

lK

The Learned VA GMM A VA Point Query

Fold In

The query is Dominated by the VA Gaussian of A2

Pseudo Song

Distribution

1

ˆ ˆarg max log ( | , )K

k k kk

pl=

= å el

l m S e

Using the EM algorithm

1

2

K-1

K

Acoustic GMM

Posterior

MusicDatabase

Page 20: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

20

Evaluation – Dataset

• Two corpora used: MER60 and MTurk• MER60

– 60 music clips, each is 30-second– 99 subjects in total, making each clip annotated by 40 subjects– The VA values are entered by clicking on the emotion space on a

computer display

• MTurk– 240 clips, each is 15-second– Collected via Amazon's Mechanical Turk– Each subject rated the per-second VA values for 11 randomly-

selected clips using a graphical interface– Automatic verification step employed, finalizing each clip with 7

to 23 subjects

Page 21: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

21

Evaluation – Acoustic Features

• Adopt the bag-of-frames representation• All the frames of a clip are aggregated into the acoustic

GMM posterior and perform the analysis of emotion at the clip-level, instead of frame-level

• MER60: extracted by MIRToolbox– Dynamic, spectral, timbre (including 13 MFCCs, 13 delta MFCCs,

and 13 delta-delta MFCCs), and tonal– 70-dim all concatenation or 39-dim MFCCs

• MTurk: provided by Schmidt et al.– MFCCs, chroma, spectrum descriptors, and spectral contrast– 50-dim all concatenation, 20-dim MFCCs, or 14-dim spectral

contrast

Page 22: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

22

Evaluation Metric for Emotion Annotation

• Average KL divergence (AKL)– Measure the KL divergence from the predicted VA Gaussian of a

test clip to its ground truth VA Gaussian

• Average Mean Distance (AMD)– Measure the Euclidean distance between the mean vectors of

the predicted and ground truth VA Gaussians

( )1 1 1P G P G P G G P G

1tr( ) log ( ) ( ) 2

2T- - -- + - - -m m m mS S S S S

P G P G( ) ( )T- -m m m m

Page 23: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

23

Result for Emotion Annotation

• MER60, leave-one-out train and test• MTurk, 70%-30% randomly splitting train and test

SmallerBetter

Page 24: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

24

Summary for Emotion Annotation

• The performance saturates when K is sufficiently large• Larger scale corpus prefers larger K (feature resolution)• Annotation prior is effective for the AKL performance• For MER60, 70-D concat feature performs the best • For MTurk, using MFCCs alone is more effective• MTurk is easier and presents smaller performance scale

Page 25: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

25

Result for Music Retrieval

• MTurk: 2,520 clips training, 1,080 clips for retrieval database• Evaluate the ranking using the Normalized Discounted

Cumulative Gain (NDCG) with 5, 10, and 20 retrieved clips

2 2

( )1NDCG@ (1)

log

P

iP

R iP R

Z i=

ì üï ïï ï= +í ýï ïï ïî þå

Larger Better

Page 26: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

26

Conclusion and Future Work

• The AEG model provides a principled probabilistic framework that is technically sound, and also unifies the emotion-based music annotation and retrieval

• AEG can better take into account the subjective nature of emotion perception

• Transparency and interpretability of the model learning and semantic mapping processes

• The potential for incorporating multi-modal content• Dynamic personalization via model adaptation• Alignment among multi-modal emotion semantics

Page 27: The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

27

Appendix: PWKL for Emotion Corpus

PWKL

5.095

1.985

• PWKL: the diversity of ground truth among all songs in a corpus, the larger the more diverse

• We compute the (pair-wise) KL divergence between the ground truth annotation Gaussians of each pair of clips in a corpus

• MTurk is easier, since a safer prediction, the origin, can gain good performance