Generating Ground Truth for Music Mood Classification Using Mechanical Turk Jin Ha Lee & Xiao Hu JCDL 2012
Dec 01, 2014
Generating Ground Truth for Music Mood Classification Using
Mechanical TurkJin Ha Lee & Xiao Hu
JCDL 2012
Mood: a relatively long lasting and stable emotional state (Meyer, 1956)
Emotion?Affect?
Music mood
• Recently received a lot of attention in
MIR (Music Information Retrieval) domain
• “Audio Music Mood Classification” task in MIREX, starting in 2007
• Critical for developing MDLMusic Information RetrievalEvaluation eXchange
• Evaluation is based on ground truth
Passionate
Bittersweet
Bittersweet
Bittersweet
More is better!
However, generating ground truth based on human input is
expensive and time consuming
How is it done in MIREX?
• A web-based survey system called E6K
• Invitations posted to MIREX and music-ir mailing lists in order to recruit volunteers
Can we use the
CROWDinstead ofMUSIC EXPERTS?
Is there a
better way?
1. How do music mood classification
results obtained from Mechanical Turk compare to those collected
from music experts in MIREX?
2. How different or similar are the
evaluation outcomes for
MIREX AMC task when based on ground truth collected from Mechanical Turk vs. E6K?
Workers (Turkers)
Task Requester
Amazon Mechanical Turk(MTurk)
Cluster1 passionate, rousing, confident, boisterous, rowdy
Cluster2 cheerful, fun, rollicking, sweet, amiable/good natured
Cluster3 bittersweet, poignant, wistful, literate, autumnal, brooding
Cluster4 humorous, silly, campy, quirky, whimsical, witty, wry
Cluster5 aggressive, intense, fiery, tense/anxious, volatile, visceral
TASK:Listen to 30 second music clips →Select one of the five mood clusters ↓
Qualification test
Consistency check
Review process
1250 songs
x 2 judgments
2500 unique mood judgments
186 HITs collected
- 86 HITs rejected
100 HITs acceptedBasic Stats
1HIT =25 songs
EVALUTRON 6000
Stats on Collecting Data
Average Time Spent on Each Music Clip
21.54 seconds 17.46 seconds
Total Time for Collecting All Judgments
38 days(+ additional in-house
assessment)
19 days
Cost for Collecting All Judgments
$0 $60.50
Comparison of E6K and MTurk data
Cluster E6K MTurk Diff. in % (E6K-MTurk)
Cluster1 405 (16.4%) 450 (18.0%) -1.6%
Cluster2 472 (19.1%) 536 (21.4%) -2.3%
Cluster3 542 (22.0%) 622 (24.9%) -2.9%
Cluster4 412 (16.7%) 367 (14.7%) 2.0%
Cluster5 400 (16.2%) 403 (16.1%) 0.1%
Other 237 (9.6%) 122 (4.9%) 4.7%
Total 2468 2500 -
Number of Judgments and Distribution across Clusters
Distribution of Agreement
Cluster E6K MTurk Both
Cluster1 121 89 29
Cluster2 130 131 44
Cluster3 163 216 91
Cluster4 121 85 42
Cluster5 126 121 64
Total 661 642 270
Confusion among the Clusters
Clusters Disagreed in E6K
Disagreed IN MTurk
Cluster 1 & Cluster 2 20 95
Cluster 2 & Cluster 4 31 86
Cluster 1 & Cluster 5 13 74
⁞ ⁞ ⁞
Cluster 3 & Cluster 4 6 27
Cluster 2 & Cluster 5 1 22
Cluster 3 & Cluster 5 1 20
Total 253 595
Cluster 1
Cluster 2
Cluster 5
Cluster 4
Cluster 3
Russell’s model
System Performance
E6KAverage accurac
yMTurk
Average accurac
yCL 0.65 GT 0.66GT 0.64 CL 0.63TL 0.64 TL 0.63
ME1 0.61 ME1 0.57ME2 0.61 ME2 0.57IM2 0.57 IM2 0.57KL1 0.56 KL1 0.55IM1 0.53 IM1 0.54KL2 0.29 KL2 0.29
TK-HSD Rank Comparison E6K
MTurkE6K
Conclusion
• Overall the human judgments from E6K and MTurk showed similar patterns:– Judgment distribution across five mood
clusters– Agreement distribution across clusters– Confusion among clusters
• System performance rankings from E6K and Mturk were also comparable
Conclusion (Cont’d.)
• However, combined ground truth from E6K and MTurk is only about 60% the size of the original E6K ground truth
• Mood is a highly subjective feature for describing and organizing music
• Other means for judging the moods should be explored (e.g., ranking)
Future work
• In-depth interview with users to investigate factors affecting people’s judgments on music mood
• More controlled study with different user groups
Questions?