Music Information Retrieval for Jazz

MIR for Jazz - Dan Ellis 2012-11-15 /291

1. Music Information Retrieval2. Automatic Tagging3. Musical Content4. Future Work

Music Information Retrievalfor Jazz

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

{dpwe,thierry}@ee.columbia.edu http://labrosa.ee.columbia.edu/

mailto:[email protected]

mailto:[email protected]

http://labrosa.ee.columbia.edu

http://labrosa.ee.columbia.edu


Machine Listening

2

• Extracting useful information from sound... like (we) animals do

Dectect

Classify

Describe

EnvironmentalSound

Speech Music

Task

Domain

ASR MusicTranscription

MusicRecommendation

VAD Speech/Music

Emotion

“Sound Intelligence”

EnvironmentAwareness

AutomaticNarration


1. The Problem

• We have a lot of music Can computers help?

• Applicationsarchive organization ◦ music recommendationmusicological insight?

3


Music Information Retrieval (MIR)• Small field that has grown since ~2000

musicologists, engineers, librarianssignificant commercial interest

• MIR as musical analog of text IRfind stuff in large archives

• Popular tasksgenre classificationchord, melody, full transcriptionmusic recommendation

• Annual evaluations“Standard” test corpora - pop

4


2. Automatic Tagging• Statistical Pattern Recognition:

Finding matches to training examples

• Need:Feature design ◦ Labeled training examples

• ApplicationsGenre ◦ Instrumentation ◦ Artist ◦ Studio ...

5

signal

segment

feature vector

class

Pre-processing/segmentation

Feature extraction

Classification

Post-processing

Sensor

200 400 600 800 1000 1200600

800

1000

1200

1400

1600


Features: MFCC• Mel-Frequency Cepstral Coefficients

the standard features from speech recognition

6

0.25 0.255 0.26 0.265 0.27 time / s−0.5

0

0.5

0 1000 2000 3000 freq / Hz05

10

0 5 10 15 freq / Mel012

x 104

0 5 10 15 freq / Mel

050

100

0 10 20 30 quefrency−200

0

200

FFT X[k]

Mel scalefreq. warp

log |X[k]|

IFFT

Truncate

MFCCs

Sound

spectra

audspec

cepstra


Representing Audio• MFCCs are short-time features (25 ms)• Sound is a

“trajectory” in MFCC space

• Describe whole track by its statistics

7

VTS_04_0001 - Spectrogram

freq

/ kHz

1 2 3 4 5 6 7 8 9012345678

-20-100102030

time / sec

time / seclevel / dB

value

MFC

C bin

1 2 3 4 5 6 7 8 92468

101214161820

-20-15-10-505101520

MFCC dimensionM

FCC

dimen

sion

MFCC covariance

5 10 15 20

2468

101214161820

-50

0

50

Audio

MFCCfeatures

MFCCCovariance

Matrix


MFCCs for Music• Can resynthesize MFCCs by shaping noise

gives an idea about the information retained

8

freq

/ Hz

mel

quef

renc

y

203418860

17703644

2468

1012

time / s

freq

/ Hz

0 10 20 30 40 50 60203418860

17703644

Freddy Freeloader

Original

MFCC

sResynthesis


Ground Truth• MajorMiner: Free-text tags for 10s clips

400 users, 7500 unique tags, 70,000 taggings

• Example: drum, bass, piano, jazz, slow, instrumental, saxophone, soft, quiet, club, ballad, smooth, soulful, easy_listening, swing, improvisation, 60s, cool, light

9

Mandel & Ellis ’08


Classification

• MFCC features+ human ground truth+ standard machine learning tools

10

MFCC(20 dims)

AveragePrec.Sound

Groundtruth

Chop into10 sec blocks ∆

∆2

Mean One-vs-allSVM

Standardizeacrosstrainset

µ(60 dims)

C, γ

CovarianceΣ (399

samples)


Classification Results• Classifiers trained from top 50 tags

11

40 80 120 160 200 240 280 320drumguitarmalesynthrockelectronicpopvocalbassfemaledancetechnopianojazzhip_hoprapslowbeatvoice80selectronicainstrumentalfastsaxophonekeyboardcountrydrum_machinedistortionbritishambientsoftfunkr_balternativehouseindiestringssolonoisequietsilencesamplespunkhornssingingdrum_bassendtranceclub_90s

320−2

−1.5

−1

−0.5

0

0.5

1

1.5

01 Soul Eyes

50 100 150 200 250 300

time / s

freq

/ Hz

135240427761

13562416


3. Musical Content• MFCCs (and speech recognizers)

don’t respect pitch

pitch is importantvisible in spectrogram

• Pitch-related tasksnote transcriptionchord transcriptionmatching by musical content (“cover songs”)

12

time / s

freq

/ Hz

46 48 50 52 54

203

418

860

1770

3644


Note Transcription

13

Poliner & Ellis ‘05,’06,’07

Classification:

•N-binary SVMs (one for ea. note).

•Independent frame-level

classification on 10 ms grid.

•Dist. to class bndy as posterior.

classification posteriors

Temporal Smoothing:

•Two state (on/off) independent

HMM for ea. note. Parameters

learned from training data.

•Find Viterbi sequence for ea. note.

hmm smoothing

Training data and features:

•MIDI, multi-track recordings, playback piano, & resampled audio

(less than 28 mins of train audio). •Normalized magnitude STFT.

feature representation feature vector


Polyphonic Transcription• Real music excerpts + ground truth

14

MIREX 2007

Frame-level transcriptionEstimate the fundamental frequency of all notes present on a 10 ms grid

Note-level transcriptionGroup frame-level predictions into note-level transcriptions by estimating onset/offset

0

0.25

0.50

0.75

1.00

1.25

Precision Recall Acc Etot Esubs Emiss Efa

0

0.25

0.50

0.75

1.00

1.25

Precision Recall Ave. F-measure Ave. Overlap


Chroma Features• Idea: Project onto 12 semitones

regardless of octavemaintains main “musical” distinctioninvariant to musical equivalenceno need to worry about harmonics?

W(k) is weighting, B(b) selects every ~ mod1215

C(b) =NM�

k=0

B(12 log2(k/k0)� b)W (k)|X[k]|

50 100 150 fft bin 2 4 6 8 time / sec 50 100 150 200 250 time / frame

freq

/ kHz

01234

chro

ma

A

CD

FG

chro

ma

A

CD

FG

Fujishima 1999

War

ren

et a

l. 200

3


Chroma Resynthesis• Chroma describes the notes in an octave

... but not the octave

• Can resynthesize by presenting all octaves... with a smooth envelope“Shepard tones” - octave is ambiguous

endless sequence illusion

16

0 500 1000 1500 2000 2500 freq / Hz-60-50-40-30-20-10

0

2 4 6 8 10 time / sec

freq

/ kH

z

leve

l / d

B

0

1

2

3

4Shepard tone resynth12 Shepard tone spectra

yb(t) =M�

o=1

W (o +b

12) cos 2o+ b

12 w0t

Ellis & Poliner 2007


Chroma Example• Simple Shepard tone resynthesis

can also reimpose broad spectrum from MFCCs

17

freq

/ Hz

203418860

17703644

203418

CDEGA

time / s

Freddie

freq

/ Hz

chro

ma

0 10 20 30 40 50 60203418860

17703644


Onset detection• Simplest thing is

energy envelope

18

e(n0) =W/2�

n=�W/2

w[n] |x(n + n0)|2

time / sec

freq

/ kHz

level

/ dB

level

/ dB

0 2 4 6 8 10

0

2

4

6

8

0102030405060

0102030405060

time / sec0 2 4 6 8 10

Harnoncourt Maracatu

Bello et al. 2005

emphasis on high frequencies?

�

f

f · |X(f, t)|

�

f

|X(f, t)|


Tempo Estimation• Beat tracking (may) need global tempo period τ

otherwise problem is not “optimal substructure”

19

8 8.5 9 9.5 10 10.5 11 11.5 12-2

0

2

4

0

200

400

0 0.5 1 1.5 2 2.5 3 3.5 4lag / s

time / s

-100

0

100

200

Onset Strength Envelope (part)

Raw Autocorrelation

Windowed Autocorrelation

Primary Tempo PeriodSecondary Tempo Period

• Pick peak in onset envelope autocorrelationafter applying “human preference” windowcheck for subbeat


Beat Tracking by Dynamic Programming• To optimize

define C*(t) as best score up to time t then build up recursively (with traceback P(t))

final beat sequence {ti} is best C* + back-trace

20

C({ti}) =N�

i=1

O(ti) + �N�

i=2

F (ti � ti�1, �p)

C*(t) = O(t) + max{αF(t – τ, τp) + C*(τ)}τ

P(t) = argmax{αF(t – τ, τp) + C*(τ)}τ

tτ

O(t)

C*(t)


Beat Tracking Results• Prefers drums & steady tempo

21

time / s0 5 10 15

Soul Eyes

20 25

0 100 200 300 400 500 600 700 800 900period / 4 ms samp


Beat-Synchronous Chroma• Record one chroma vector per beat

compact representation of harmonies

22

freq

/ Hz

203418860

17703644

20 40 60 80

Freddy

100 120 time / beatCDEGAB

203418860

17703644

time / s0 10 20 30 40 50 60203418860

17703644

Beat-sync

chroma

Resynth

Resynth

+MFCC


Chord Recognition• Beat synchronous chroma look like chords

can we transcribe them?

• Two approachesmanual templates (prior knowledge)learned models (from training data)

23

ACDEG

chro

ma

bin

time / sec0 5 10 15 20

C-E

-G

B-D

-G

A-C

-E

A-C

-D-F ...


Chord Recognition System• Analogous to speech recognition

Gaussian models of features for each chordHidden Markov Models for chord transitions

24

Audio

Labels

Beat track

Resample

Chroma100-1600 HzBPF

Chroma25-400 HzBPF

Root normalize

HMMViterbi

Counttransitions

Gaussian Unnormalize

beat-synchronouschroma features

chordlabels

24x24transition

matrix

24Gaussmodels

traintest

C D E G A BCDE

GAB

CDE

GAB

C D E G A B

C maj

c min

C D E F G A B c d e f g a bC

D

EF

G

A

Bc

d

ef

g

a

b

Sheh & Ellis 2003


Chord Recognition• Often works:

• But only about 60% of the time

25

freq

/ Hz

Let It Be/06-Let It Be

C G A:min

A:mi

n/b7

F:ma

j7F:

maj6

C G F C C G A:min

A:mi

n/b7

F:ma

j7

240

761

2416

0 2 4 6 8 10 12 14 16 18

ABCDEFG

C G a F C G F C G a

Groundtruthchord

Audio

Recognized

Beat-synchronous

chroma


• Chord model centers (means) indicate chord ‘templates’:

What did the models learn?

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4PCP_ROT family model means (train18)

DIMDOM7MAJMINMIN7

C D E F G A B C (for C-root chords)


Chords for Jazz• How many types?

27

time / sec

freq

/ Hz

Freddy − logf sgram

0 10 20 30 40 50 60

240

761

2416

Freddy − beat−sync chroma

chro

ma

CC#DD#E

FF#GG#AA#B

CC#

Freddy − chord likelihoods + Viterbi path

chor

ds

CDEF#G#A#cde

f#g#a#

CDCCFreddy − chord−based chroma reconstruction

chro

ma

time / beats20 40 60 80 100 120CC#D

D#EFF#G

G#AA#B


Future Work• Matching items

cover songs / standardssimilar instruments, styles

• Analyzing musical contentsolo transcription & modelingmusical structure

• And so much more...

28

Between the Bars − Elliot Smith − pwr .25

120 130 140 150 160 1702468

1012

Between the Bars − Glenn Phillips − pwr .25 − −17 beats − trsp 2

120 130 140 150 160 1702468

1012

pointwise product (sum(12x300) = 19.34)

time / beats

chro

ma

bin

120 130 140 150 160 1702468

1012


Summary• Finding Musical Similarity at Large Scale

29

Musicaudio

Tempoand beat

Low-levelfeatures Classification

and Similarity

MusicStructureDiscovery

Melodyand notes

Keyand chords

browsingdiscoveryproduction

modelinggenerationcuriosity

Music Information Retrieval for Jazz

Documents