Generating Ground Truth for Music Mood Classification Using Mechanical Turk

Generating Ground Truth for Music Mood Classification Using

Mechanical TurkJin Ha Lee & Xiao Hu

JCDL 2012

Mood: a relatively long lasting and stable emotional state (Meyer, 1956)

Emotion?Affect?

Music mood

• Recently received a lot of attention in

MIR (Music Information Retrieval) domain

• “Audio Music Mood Classification” task in MIREX, starting in 2007

• Critical for developing MDLMusic Information RetrievalEvaluation eXchange

• Evaluation is based on ground truth

Passionate

Bittersweet

Bittersweet

Bittersweet

More is better!

However, generating ground truth based on human input is

expensive and time consuming

How is it done in MIREX?

• A web-based survey system called E6K

• Invitations posted to MIREX and music-ir mailing lists in order to recruit volunteers

Can we use the

CROWDinstead ofMUSIC EXPERTS?

Is there a

better way?

1. How do music mood classification

results obtained from Mechanical Turk compare to those collected

from music experts in MIREX?

2. How different or similar are the

evaluation outcomes for

MIREX AMC task when based on ground truth collected from Mechanical Turk vs. E6K?

Workers (Turkers)

Task Requester

Amazon Mechanical Turk(MTurk)

Cluster1 passionate, rousing, confident, boisterous, rowdy

Cluster2 cheerful, fun, rollicking, sweet, amiable/good natured

Cluster3 bittersweet, poignant, wistful, literate, autumnal, brooding

Cluster4 humorous, silly, campy, quirky, whimsical, witty, wry

Cluster5 aggressive, intense, fiery, tense/anxious, volatile, visceral

TASK:Listen to 30 second music clips →Select one of the five mood clusters ↓

Qualification test

Consistency check

Review process

1250 songs

x 2 judgments

2500 unique mood judgments

186 HITs collected

- 86 HITs rejected

100 HITs acceptedBasic Stats

1HIT =25 songs

EVALUTRON 6000

Stats on Collecting Data

Average Time Spent on Each Music Clip

21.54 seconds 17.46 seconds

Total Time for Collecting All Judgments

38 days(+ additional in-house

assessment)

19 days

Cost for Collecting All Judgments

$0 $60.50

Comparison of E6K and MTurk data

Cluster E6K MTurk Diff. in % (E6K-MTurk)

Cluster1 405 (16.4%) 450 (18.0%) -1.6%

Cluster2 472 (19.1%) 536 (21.4%) -2.3%

Cluster3 542 (22.0%) 622 (24.9%) -2.9%

Cluster4 412 (16.7%) 367 (14.7%) 2.0%

Cluster5 400 (16.2%) 403 (16.1%) 0.1%

Other 237 (9.6%) 122 (4.9%) 4.7%

Total 2468 2500 -

Number of Judgments and Distribution across Clusters

Distribution of Agreement

Cluster E6K MTurk Both

Cluster1 121 89 29

Cluster2 130 131 44

Cluster3 163 216 91

Cluster4 121 85 42

Cluster5 126 121 64

Total 661 642 270

Confusion among the Clusters

Clusters Disagreed in E6K

Disagreed IN MTurk

Cluster 1 & Cluster 2 20 95



⁞ ⁞ ⁞




Total 253 595

Cluster 1

Cluster 2

Cluster 5

Cluster 4

Cluster 3

Russell’s model

System Performance

E6KAverage accurac

yMTurk

Average accurac

yCL 0.65 GT 0.66GT 0.64 CL 0.63TL 0.64 TL 0.63

ME1 0.61 ME1 0.57ME2 0.61 ME2 0.57IM2 0.57 IM2 0.57KL1 0.56 KL1 0.55IM1 0.53 IM1 0.54KL2 0.29 KL2 0.29

TK-HSD Rank Comparison E6K

MTurkE6K

Conclusion

• Overall the human judgments from E6K and MTurk showed similar patterns:– Judgment distribution across five mood

clusters– Agreement distribution across clusters– Confusion among clusters

• System performance rankings from E6K and Mturk were also comparable

Conclusion (Cont’d.)

• However, combined ground truth from E6K and MTurk is only about 60% the size of the original E6K ground truth

• Mood is a highly subjective feature for describing and organizing music

• Other means for judging the moods should be explored (e.g., ranking)

Future work

• In-depth interview with users to investigate factors affecting people’s judgments on music mood

• More controlled study with different user groups

Questions?

Generating Ground Truth for Music Mood Classification Using Mechanical Turk

Technology

e6kmturk cluster1

music experts

music clips

original e6k ground

clusters clusters

comparison of e6k

e6k invitations

unique mood judgments