Page 1
MIR for Jazz - Dan Ellis 2012-11-15 /291
1. Music Information Retrieval2. Automatic Tagging3. Musical Content4. Future Work
Music Information Retrievalfor Jazz
Dan EllisLaboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
{dpwe,thierry}@ee.columbia.edu http://labrosa.ee.columbia.edu/
Page 2
MIR for Jazz - Dan Ellis 2012-11-15 /29
Machine Listening
2
• Extracting useful information from sound... like (we) animals do
Dectect
Classify
Describe
EnvironmentalSound
Speech Music
Task
Domain
ASR MusicTranscription
MusicRecommendation
VAD Speech/Music
Emotion
“Sound Intelligence”
EnvironmentAwareness
AutomaticNarration
Page 3
MIR for Jazz - Dan Ellis 2012-11-15 /29
1. The Problem
• We have a lot of music Can computers help?
• Applicationsarchive organization ◦ music recommendationmusicological insight?
3
Page 4
MIR for Jazz - Dan Ellis 2012-11-15 /29
Music Information Retrieval (MIR)• Small field that has grown since ~2000
musicologists, engineers, librarianssignificant commercial interest
• MIR as musical analog of text IRfind stuff in large archives
• Popular tasksgenre classificationchord, melody, full transcriptionmusic recommendation
• Annual evaluations“Standard” test corpora - pop
4
Page 5
MIR for Jazz - Dan Ellis 2012-11-15 /29
2. Automatic Tagging• Statistical Pattern Recognition:
Finding matches to training examples
• Need:Feature design ◦ Labeled training examples
• ApplicationsGenre ◦ Instrumentation ◦ Artist ◦ Studio ...
5
signal
segment
feature vector
class
Pre-processing/segmentation
Feature extraction
Classification
Post-processing
Sensor
200 400 600 800 1000 1200600
800
1000
1200
1400
1600
Page 6
MIR for Jazz - Dan Ellis 2012-11-15 /29
Features: MFCC• Mel-Frequency Cepstral Coefficients
the standard features from speech recognition
6
0.25 0.255 0.26 0.265 0.27 time / s−0.5
0
0.5
0 1000 2000 3000 freq / Hz05
10
0 5 10 15 freq / Mel012
x 104
0 5 10 15 freq / Mel
050
100
0 10 20 30 quefrency−200
0
200
FFT X[k]
Mel scalefreq. warp
log |X[k]|
IFFT
Truncate
MFCCs
Sound
spectra
audspec
cepstra
Page 7
MIR for Jazz - Dan Ellis 2012-11-15 /29
Representing Audio• MFCCs are short-time features (25 ms)• Sound is a
“trajectory” in MFCC space
• Describe whole track by its statistics
7
VTS_04_0001 - Spectrogram
freq
/ kHz
1 2 3 4 5 6 7 8 9012345678
-20-100102030
time / sec
time / seclevel / dB
value
MFC
C bin
1 2 3 4 5 6 7 8 92468
101214161820
-20-15-10-505101520
MFCC dimensionM
FCC
dimen
sion
MFCC covariance
5 10 15 20
2468
101214161820
-50
0
50
Audio
MFCCfeatures
MFCCCovariance
Matrix
Page 8
MIR for Jazz - Dan Ellis 2012-11-15 /29
MFCCs for Music• Can resynthesize MFCCs by shaping noise
gives an idea about the information retained
8
freq
/ Hz
mel
quef
renc
y
203418860
17703644
2468
1012
time / s
freq
/ Hz
0 10 20 30 40 50 60203418860
17703644
Freddy Freeloader
Original
MFCC
sResynthesis
Page 9
MIR for Jazz - Dan Ellis 2012-11-15 /29
Ground Truth• MajorMiner: Free-text tags for 10s clips
400 users, 7500 unique tags, 70,000 taggings
• Example: drum, bass, piano, jazz, slow, instrumental, saxophone, soft, quiet, club, ballad, smooth, soulful, easy_listening, swing, improvisation, 60s, cool, light
9
Mandel & Ellis ’08
Page 10
MIR for Jazz - Dan Ellis 2012-11-15 /29
Classification
• MFCC features+ human ground truth+ standard machine learning tools
10
MFCC(20 dims)
AveragePrec.Sound
Groundtruth
Chop into10 sec blocks ∆
∆2
Mean One-vs-allSVM
Standardizeacrosstrainset
µ(60 dims)
C, γ
CovarianceΣ (399
samples)
Page 11
MIR for Jazz - Dan Ellis 2012-11-15 /29
Classification Results• Classifiers trained from top 50 tags
11
40 80 120 160 200 240 280 320drumguitarmalesynthrockelectronicpopvocalbassfemaledancetechnopianojazzhip_hoprapslowbeatvoice80selectronicainstrumentalfastsaxophonekeyboardcountrydrum_machinedistortionbritishambientsoftfunkr_balternativehouseindiestringssolonoisequietsilencesamplespunkhornssingingdrum_bassendtranceclub_90s
320−2
−1.5
−1
−0.5
0
0.5
1
1.5
01 Soul Eyes
50 100 150 200 250 300
time / s
freq
/ Hz
135240427761
13562416
Page 12
MIR for Jazz - Dan Ellis 2012-11-15 /29
3. Musical Content• MFCCs (and speech recognizers)
don’t respect pitch
pitch is importantvisible in spectrogram
• Pitch-related tasksnote transcriptionchord transcriptionmatching by musical content (“cover songs”)
12
time / s
freq
/ Hz
46 48 50 52 54
203
418
860
1770
3644
Page 13
MIR for Jazz - Dan Ellis 2012-11-15 /29
Note Transcription
13
Poliner & Ellis ‘05,’06,’07
Classification:
•N-binary SVMs (one for ea. note).
•Independent frame-level
classification on 10 ms grid.
•Dist. to class bndy as posterior.
classification posteriors
Temporal Smoothing:
•Two state (on/off) independent
HMM for ea. note. Parameters
learned from training data.
•Find Viterbi sequence for ea. note.
hmm smoothing
Training data and features:
•MIDI, multi-track recordings, playback piano, & resampled audio
(less than 28 mins of train audio). •Normalized magnitude STFT.
feature representation feature vector
Page 14
MIR for Jazz - Dan Ellis 2012-11-15 /29
Polyphonic Transcription• Real music excerpts + ground truth
14
MIREX 2007
Frame-level transcriptionEstimate the fundamental frequency of all notes present on a 10 ms grid
Note-level transcriptionGroup frame-level predictions into note-level transcriptions by estimating onset/offset
0
0.25
0.50
0.75
1.00
1.25
Precision Recall Acc Etot Esubs Emiss Efa
0
0.25
0.50
0.75
1.00
1.25
Precision Recall Ave. F-measure Ave. Overlap
Page 15
MIR for Jazz - Dan Ellis 2012-11-15 /29
Chroma Features• Idea: Project onto 12 semitones
regardless of octavemaintains main “musical” distinctioninvariant to musical equivalenceno need to worry about harmonics?
W(k) is weighting, B(b) selects every ~ mod1215
C(b) =NM�
k=0
B(12 log2(k/k0)� b)W (k)|X[k]|
50 100 150 fft bin 2 4 6 8 time / sec 50 100 150 200 250 time / frame
freq
/ kHz
01234
chro
ma
A
CD
FG
chro
ma
A
CD
FG
Fujishima 1999
War
ren
et a
l. 200
3
Page 16
MIR for Jazz - Dan Ellis 2012-11-15 /29
Chroma Resynthesis• Chroma describes the notes in an octave
... but not the octave
• Can resynthesize by presenting all octaves... with a smooth envelope“Shepard tones” - octave is ambiguous
endless sequence illusion
16
0 500 1000 1500 2000 2500 freq / Hz-60-50-40-30-20-10
0
2 4 6 8 10 time / sec
freq
/ kH
z
leve
l / d
B
0
1
2
3
4Shepard tone resynth12 Shepard tone spectra
yb(t) =M�
o=1
W (o +b
12) cos 2o+ b
12 w0t
Ellis & Poliner 2007
Page 17
MIR for Jazz - Dan Ellis 2012-11-15 /29
Chroma Example• Simple Shepard tone resynthesis
can also reimpose broad spectrum from MFCCs
17
freq
/ Hz
203418860
17703644
203418
CDEGA
time / s
Freddie
freq
/ Hz
chro
ma
0 10 20 30 40 50 60203418860
17703644
Page 18
MIR for Jazz - Dan Ellis 2012-11-15 /29
Onset detection• Simplest thing is
energy envelope
18
e(n0) =W/2�
n=�W/2
w[n] |x(n + n0)|2
time / sec
freq
/ kHz
level
/ dB
level
/ dB
0 2 4 6 8 10
0
2
4
6
8
0102030405060
0102030405060
time / sec0 2 4 6 8 10
Harnoncourt Maracatu
Bello et al. 2005
emphasis on high frequencies?
�
f
f · |X(f, t)|
�
f
|X(f, t)|
Page 19
MIR for Jazz - Dan Ellis 2012-11-15 /29
Tempo Estimation• Beat tracking (may) need global tempo period τ
otherwise problem is not “optimal substructure”
19
8 8.5 9 9.5 10 10.5 11 11.5 12-2
0
2
4
0
200
400
0 0.5 1 1.5 2 2.5 3 3.5 4lag / s
time / s
-100
0
100
200
Onset Strength Envelope (part)
Raw Autocorrelation
Windowed Autocorrelation
Primary Tempo PeriodSecondary Tempo Period
• Pick peak in onset envelope autocorrelationafter applying “human preference” windowcheck for subbeat
Page 20
MIR for Jazz - Dan Ellis 2012-11-15 /29
Beat Tracking by Dynamic Programming• To optimize
define C*(t) as best score up to time t then build up recursively (with traceback P(t))
final beat sequence {ti} is best C* + back-trace
20
C({ti}) =N�
i=1
O(ti) + �N�
i=2
F (ti � ti�1, �p)
C*(t) = O(t) + max{αF(t – τ, τp) + C*(τ)}τ
P(t) = argmax{αF(t – τ, τp) + C*(τ)}τ
tτ
O(t)
C*(t)
Page 21
MIR for Jazz - Dan Ellis 2012-11-15 /29
Beat Tracking Results• Prefers drums & steady tempo
21
time / s0 5 10 15
Soul Eyes
20 25
0 100 200 300 400 500 600 700 800 900period / 4 ms samp
Page 22
MIR for Jazz - Dan Ellis 2012-11-15 /29
Beat-Synchronous Chroma• Record one chroma vector per beat
compact representation of harmonies
22
freq
/ Hz
203418860
17703644
20 40 60 80
Freddy
100 120 time / beatCDEGAB
203418860
17703644
time / s0 10 20 30 40 50 60203418860
17703644
Beat-sync
chroma
Resynth
Resynth
+MFCC
Page 23
MIR for Jazz - Dan Ellis 2012-11-15 /29
Chord Recognition• Beat synchronous chroma look like chords
can we transcribe them?
• Two approachesmanual templates (prior knowledge)learned models (from training data)
23
ACDEG
chro
ma
bin
time / sec0 5 10 15 20
C-E
-G
B-D
-G
A-C
-E
A-C
-D-F ...
Page 24
MIR for Jazz - Dan Ellis 2012-11-15 /29
Chord Recognition System• Analogous to speech recognition
Gaussian models of features for each chordHidden Markov Models for chord transitions
24
Audio
Labels
Beat track
Resample
Chroma100-1600 HzBPF
Chroma25-400 HzBPF
Root normalize
HMMViterbi
Counttransitions
Gaussian Unnormalize
beat-synchronouschroma features
chordlabels
24x24transition
matrix
24Gaussmodels
traintest
C D E G A BCDE
GAB
CDE
GAB
C D E G A B
C maj
c min
C D E F G A B c d e f g a bC
D
EF
G
A
Bc
d
ef
g
a
b
Sheh & Ellis 2003
Page 25
MIR for Jazz - Dan Ellis 2012-11-15 /29
Chord Recognition• Often works:
• But only about 60% of the time
25
freq
/ Hz
Let It Be/06-Let It Be
C G A:min
A:mi
n/b7
F:ma
j7F:
maj6
C G F C C G A:min
A:mi
n/b7
F:ma
j7
240
761
2416
0 2 4 6 8 10 12 14 16 18
ABCDEFG
C G a F C G F C G a
Groundtruthchord
Audio
Recognized
Beat-synchronous
chroma
Page 26
MIR for Jazz - Dan Ellis 2012-11-15 /2926
• Chord model centers (means) indicate chord ‘templates’:
What did the models learn?
0 5 10 15 20 250
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4PCP_ROT family model means (train18)
DIMDOM7MAJMINMIN7
C D E F G A B C (for C-root chords)
Page 27
MIR for Jazz - Dan Ellis 2012-11-15 /29
Chords for Jazz• How many types?
27
time / sec
freq
/ Hz
Freddy − logf sgram
0 10 20 30 40 50 60
240
761
2416
Freddy − beat−sync chroma
chro
ma
CC#DD#E
FF#GG#AA#B
CC#
Freddy − chord likelihoods + Viterbi path
chor
ds
CDEF#G#A#cde
f#g#a#
CDCCFreddy − chord−based chroma reconstruction
chro
ma
time / beats20 40 60 80 100 120CC#D
D#EFF#G
G#AA#B
Page 28
MIR for Jazz - Dan Ellis 2012-11-15 /29
Future Work• Matching items
cover songs / standardssimilar instruments, styles
• Analyzing musical contentsolo transcription & modelingmusical structure
• And so much more...
28
Between the Bars − Elliot Smith − pwr .25
120 130 140 150 160 1702468
1012
Between the Bars − Glenn Phillips − pwr .25 − −17 beats − trsp 2
120 130 140 150 160 1702468
1012
pointwise product (sum(12x300) = 19.34)
time / beats
chro
ma
bin
120 130 140 150 160 1702468
1012
Page 29
MIR for Jazz - Dan Ellis 2012-11-15 /29
Summary• Finding Musical Similarity at Large Scale
29
Musicaudio
Tempoand beat
Low-levelfeatures Classification
and Similarity
MusicStructureDiscovery
Melodyand notes
Keyand chords
browsingdiscoveryproduction
modelinggenerationcuriosity