MediaEval 2015 - Multi-Scale Approaches to the MediaEval 2015 "Emotion in Music" Task

Multi-scale Approaches to the MediaEval 2015 “Emotion in Music” Task

Mingxing Xu, Xinxing Li, Haishu Xianyu, Jiashen Tian, Fanghang Meng, Wenxiao Chen

Human-Computer Speech Interaction Lab. (HCSIL) Department of Computer Science and Technology

Tsinghua University, Beijing, China

1

Motivation / Main Idea

1. High correlation among the music feature sequence

2. Multi-scale methods at three different levels

• Acoustic feature (run 3)

• Regression model (run 1, 2)

• Emotion annotation (run 4)

Acoustic Feature

Regression Model

Emotion Annotation

2

Feature Learning with Hierarchical Deep Neural Networks (DBNs + AE) Run 3

Acoustic features were organized into 4 groups according to theirphysical fundamentals and time scales on which they were extracted. NOTE: We submitted a paper to AAAI, containing details about this framework. 3

60 ms 25 ms25 ms25 ms

win: 1s; shift: 0.5s

final features @ 2 Hz

BLSTM_60

baseline features @ 2 Hz

60

30 30

20

10

BLSTM_30

BLSTM_20

BLSTM_10

20 20

10 10 10 10 10

Fusion

Dynamic Music Emotion (Arousal, Valence)

Multi-scale BLSTM-RNNs based Fusion (1)

Run 1, 2Run 3

New Features

NOTE: • BLSTM-RNNs: 5 hidden layers (2 layers pre-trained), 250 units • Sequence length (time-scale): 60, 30, 20, 10 • Sliding window with 50% overlap used during full-song testing

4


411 clips 20 clips

411 clips 20 clips

411 clips20 clips

411 clips20 clips

411 clips20 clips

trail 1 trail 2 trail 3

partition 1 RMSE 11 RMSE 12 RMSE 13





5 different partitions: select 20 clips randomly

as the validation set

3 trails of the same model: randomized initial weights

Two criteria for model selection: 1. RMSE-first: select the model with the best RMSE for each time scale

2. RMSE+PARTITION: consider both RMSE and partition

5

BLSTM_10BLSTM_60 BLSTM_30 BLSTM_20

ELM 1

+ Delta, + Smoothing

BLSTM_10BLSTM_20

GROUP 1 RMSE-first

GROUP 2 RMSE + PARTITION

ELM 2

AVERAGE

Dynamic Music Emotion (Arousal, Valence)

AVERAGE

Run 1 Run 2Run 3


6

triangle filter, length: 50

SVR based Hierarchical Regression

songGlobal feature Global SVR

Local feature Local SVR

SUMDynamic Music Emotion

(Arousal, Valence)

clip ( 30 s )Dynamic Music Emotion

(Arousal, Valence)

Global trend

Local fluctuation

Global SVR

Local SVR

Global feature

Local feature

Global features: OpenSMILE, IS13_ComParE, 6373 Local features: OpenSMILE, IS13_ComParE_lld,130, MEAN + STD, Win: 1s, Shift: 0.5s

Run 4

7

Conclusions:

1. Several multi-scale approaches at three levels were proposed.

2. Results illustrated the effectiveness of our new methods.

3. Multi-scale BLSTMs based Fusion with ELMs (Run 2) was almost the best.

4. SVR based Hierarchical Regression is a promising method.

Future Work:

• Select the time scale automatically and systematically

• Improve multi-scale feature learning8

BLSTM-AVGBLSTM-ELMR1 + NEW-FSVR-HR

Thank you for your attention!

Questions?

9

MediaEval 2015 - Multi-Scale Approaches to the MediaEval 2015 "Emotion in Music" Task

Education