MediaEval 2014: A Multimodal Approach to Drop Detection in Electronic Dance Music

1

“Emotion in Music” organizers endeavor at Crowdsourcing task:A Multimodal Approach to Drop Detection in Electronic

Dance Music

Anna Aljanaki2*, Mohammad Soleymani1*, Frans Wiering2, Remco C. Veltkamp2

1University of Geneva, Switzerland2Utrecht University, Netherlands

* Equal technical contributions

2

Problem definition

• Given an electronic music excerpt, its timed comments, and labels from MTurk automatically identify whether the excerpt fully or partially contains a drop.

3

Material

• 15 second excerpts with timed comments including the term “drop”

• MPEG Layer 3 files• Metadata including the comments• Labels from the crowd• 164 excerpts with full agreement (105: full

drop; 4: partial drop; 54: no drop)• 70 excerpts with no agreement

4

Solutions

• Labels from crowdsourcing 1. Majority vote (MV)2. Dawid-Skene (DS)

• Labels from crowdsourcing + comments3. Naïve Bayesian classifier

• Labels from crowdsourcing + content4. Logistic regression

5

Using labels (wisdom of crowd)

Aashish Sheshadri and Matthew Lease. SQUARE: A Benchmark for Research on Computing Crowd Consensus. In Proceedings of the 1st AAAI Conference on Human Computation (HCOMP), 2013

6

Solution 1: Majority vote

• 3 labels each• Calculate the majority• If there is no agreement then the estimated

label is 2 (partial drop)

7

Solution 2: Dawid-Skene• Dawid and Skene proposed a method to combine a

number of uncertain decisions (clinician-patient) (1979)• The method is to calculate the confusion matrices for

every labeler using Expectation-Maximization to get estimates of these values (probabilities); initialized by majority vote.

• We then look at the probability of true response given a label from a given worker for all the three workers and pick the highest one.

• Get-Another-Label toolbox https://github.com/ipeirotis/Get-Another-Label

8

Solution 3: Majority Vote + comments

• For the excerpts with full or partial agreement we do not touch the MV labels

• For the remaining 70 excerpts– Features: • labels from workers• Number of times comments contain the term “drop” (We

did not normalize by the number of comments; it was a mistake!)

– Naïve Bayesian classifier trained on the samples with partial or full agreement

9

Solution 4: MV + acoustic (1)

• Again only the samples with no agreement were changed.

• Trained on the samples with full agreement (164 samples)

• Assumption: there is a moment of silence or quieter segment right after drop

• Energy from 100ms segments extracted and smoothed

10


11


• Features:– The value of the biggest local minimum in an excerpt – The fraction of the biggest minimum to an average

minimum– The number of potential drop events, as detected by

decrease in loudness bigger than threshold – The dynamic range of the excerpt

• Logistic regression for binary classification (we did not consider class 2 due to not having enough samples)

12

ResultsRun Method F1-score Full drop (1) Part. Drop (2) No drop (3)

1 Majority vote 0.69 0.72 0.31 0.752 Dawid-Skene 0.69 0.72 0.31 0.753 MV + comments 0.7 0.73 0.28 0.764 MV + acoustic 0.71 0.72 0.27 0.79

No significant improvement compared to majority vote

13

Lessons learned

• In the presence of non-malicious workers and enough labels majority vote is very hard to beat

• The scarcity of the samples from the second class reduces our performance

• In future a separate development set and evaluation set will be beneficial

14

Summary

• We primarily used the labels from MTurk since we believed it will be superior

• We proposed possible approaches taking advantage of the metadata and content when MV is indecisive

• As expected, we did not beat the majority vote

MediaEval 2014: A Multimodal Approach to Drop Detection in Electronic Dance Music

Software