-
Polyphonic music information retrieval based on multi-label
cascade classification system
presented byZbigniew W. RasUniversity of North Carolina,
Charlotte, NC College of Computing and Informatics
www.kdd.uncc.edu
http//:www.mir.uncc.edu
-
Student: Wenxin JiangAdvisor: Dr. Zbigniew W. Ras Polyphonic
music information retrieval based on multi-label cascade
classification system
-
43 MIR systemsMost are pitch estimation-based melody and rhythm
match
This presentation will focus on timbre estimation Survey of MIR-
http://mirsystems.info/
-
Outcome:Musical Database[music pieces indexed by instruments and
emotions].Resulting Database will be represented as
FS-treeguarantying efficient storage and retrieval .MIRAI - Musical
Database (mostly MUMS)[music pieces played by 59 different music
instruments]Goal: Design and Implement a System for Automatic
Indexing of Music by Instruments (objective task) and Emotions
(subjective task)
-
What is needed?Database of monophonic and polyphonic music
signals and their descriptions in terms of new features (including
temporal) in addition to the standard MPEG7 features. These signals
are labeled by instruments and emotions forming additional features
called decision features.Automatic Indexing of Music
Why is needed?To build classifiers for automatic indexing of
musical sound by instruments and emotions.
-
MIRAI - Cooperative Music Information Retrieval System based on
Automatic IndexingUserInstrumentsQueryIndexed Audio Database
QueryAdapterDurationsEmptyAnswer?Music Objects
-
Binary File PCM :Sampling Rate 44.1K Hz16 bits2,646,000
values/min.
Raw data--signal representationPCM (Pulse Code Modulation) - the
most straightforward mechanism to store audio. Analog audio is
sampled & individual samples are stored sequentially in binary
format.
-
The nature and types of raw dataChallenges to applying KDD in
MIR
Data sourceorganizationvolumeTypeQualityTraditional
dataStructuredModest Discrete,CategoricalCleanAudio
dataUnstructuredVery largeContinuous,NumericNoise
-
Feature Databasetraditional pattern recognition
FeatureExtractionlower level raw data form
Higher level representations
classificationclusteringregressionAmplitude values at each sample
pointmanageableFeature extractions
-
MPEG7 features Instantaneous Harmonic Spectral Centroid
Instantaneous Harmonic Spectral Deviation Signal Hamming Window
STFT Signal envelope FundamentalFrequency Harmonic Peaks Detection
Instantaneous Harmonic Spectral Spread Temporal Centroid Power
SpectrumSpectral Centroid Log Attack Time Instantaneous Harmonic
Spectral Variation Hamming WindowSTFTNFFT FFT points
-
Derived DatabaseExtended MPEG7 features Other features & new
features
FeatureDurationsSub-TotalTotalTristimulus
Parameters41040Spectrum Centriod/Spread II 428Flux 414Roll
Off414Zero Crossing 414MFCC44x13208Spectrum Centroid/Spread
I326Harmonic Parameters3412Flatness 34x24288Durations
313Total577
FeatureHarmonic Upper Limit1Harmoni Ratio1Basis Functions190Log
Attack Time1Temporal Centroid1Spectral Centroid1Spectrum
Centroid/Spread I2Harmonic Parameters4Flatness 24x4Total297
-
Hierarchical ClassificationSchema I
-
Schema II - Hornbostel Sachs AerophoneChordophoneMembranophone
IdiophoneFreeSingle ReedSideLip VibrationWhipAlto FluteFluteC
TrumpetFrench HornTubaOboeBassoon
-
Schema III - Play Methods
MutedPizzicatoBowedPickedPiccoloFluteBassoonAlto
FluteShakenBlow
-
Xin Cynthia Zhang*Xin Cynthia Zhang*Database Table
ObjClassification AttributesDecision AttributesCA1 CAnHornbostel
SachsPlay Method10.22 0.28[Aerophone, Side, Alto Flute][Blown, Alto
Flute]20.31 0.77[Idiophone, Concussion, Bell][Concussive,
Bell]30.05 0.21[Chordophone, Composite, Cello][Bowed, Cello]40.12
0.11[Chordophone, Composite, Violin][Martele, Violin]
Xin Cynthia Zhang
-
Example1212C[1]C[2]C[2,1]C[2,2]1212d[1]d[2]d[3,1]d[3,2]3d[3]Level
ILevel IIClassification AttributesDecision Attributes
Xabcdx1a[1]b[2]c[1]d[3]x2a[1]b[1]c[1]d[3,1]x3a[1]b[2]c[2,2]d[1]x4a[2]b[2]c[2]d[1]
-
Classification90% training, 10% testing.10 folds.Hierarchical
(Schema I) vs none hierarchical.Compare with different
classifiers.J48 treeNave Baysian
-
Results of the none-hierarchical Classification
J48-TreeNaveBaysianAll70.4923%68.5647%MPEG65.7256%56.9824%
-
Results of the hierarchical Classification (Schema I) with MPEG7
features
J48-TreeNaveBaysianFamily86.434%64.7041%No-pitch73.7299%66.2949%Percussion85.2484%84.9379%String72.4272%61.8447%Wind67.8133%67.8133%
-
Results of the hierarchical Classification (Schema I) with all
features
J48-TreeNaveBaysianFamily91.726%72.6868%No-pitch77.943%75.2169%Percussion86.0465%88.3721%String76.669%66.6021%Woodwind75.761%78.0158%
-
Classification Results
J48-TreeWith new FeaturesWithout new
FeaturesAccuracyRecallAccuracyRecallCon-clarinet100.060.083.3100.0Electricbass
100.073.393.393.3Flute 100.050.060.075.0Steel Drums
100.066.750.066.7Tuba 100.0100.0100.085.7Vibraphone
87.593.378.673.3Cello 87.095.286.761.9Violin
84.077.866.759.3Piccolo 83.350.060.060.0Marimba
82.487.583.393.8Ctrumpet 81.376.587.582.4Alto Flute
80.080.080.080.0English Horn 80.057.142.942.9
-
.Polyphonic Sound
segmentation
Feature extraction
Classifier
Get Instrument
Polyphonic sounds how to handle?Single-label classification
Based on Sound SeparationMulti-labeled classifiersGet
frameProblems?Information loss during the signal subtractionSound
Separation Flowchart
-
This presentation will focus on timbre estimation in polyphonic
soundsand designing multi-labeled classifierstimbre relevant
descriptorsSpectrum Centroid, Spread Spectrum Flatness Band
Coefficients
Harmonic Peaks Mel frequency cepstral coefficients (MFCC)
Tristimulus
-
Sub-pattern of single instrument in mixture Feature
extraction
-
FeatureExtractionFeaturesClassifier40msTimbre estimation based
on multi-label classifier segmentationAcoustic descriptors
instrumentconfidenceCandidate 170%Candidate 250%......Candidate
N10%
instrumentconfidenceCandidate 170%Candidate 250%......Candidate
N10%
instrumentconfidenceCandidate 170%Candidate 250%......Candidate
N10%
-
Polyphonic Sound
Get frame
Feature extraction
Perform multiple classifying
Finish all the Frames estimation
Voting processbased on contextGet Final winners
Multiple labelsFlowchart of multi-label classification
system
-
Timbre Estimation Results based on different methods[Instruments
- 45, Training Data (TD) - 2917 single instr. sounds from MUMS,
Testing on 308 mixed sounds randomly chosen from TD, window size 1
sec, frame size 120ms, hop size 40ms, MFCC extracted from each
frame (following MPEG-7)]Threshold 0.4 controls the total number of
estimations for each index window.
experiment #pitch basedSound SeparationN(Labels)
maxRecallPrecisionF-score1YesYes154.55%39.2%45.60%2YesYes261.20%38.1%46.96%3YesNO264.28%44.8%52.81%4YesNO467.69%37.9%48.60%5YesNO868.3%36.9%47.91%
Chart9
0.5455
0.612
Recall
Single Label Vs Multiple Label
Sheet1
pitch basedSound SeparationN(Labels)recall
YesSP154.55%
YesSP261.20%
YesNS264.28%
YesNS467.69%
NoNS470.13%
experiment #
pitch based
Sound Separation
N(Labels)
recall
1
Yes
Yes
1
54.55%
2
Yes
Yes
2
61.20%
3
Yes
NO
2
64.28%
4
Yes
NO
4
67.69%
5
No
NO
4
70.13%
Sheet1
Recall
Single Label Vs Multiple Label
Sheet2
Recall
Seperation Vs Non-Sperataion
Sheet3
experiment #descriptionRecognition Rate
1Feature-based and separation + Decision Tree (n=1)36.49%
2Feature-based and separation + Decision Tree (n=2)48.65%
3Spectrum Match + KNN (k=1;n=2)79.41%
4Spectrum Match + KNN (k=5;n=2)82.43%
5Spectrum Match + KNN (k=5;n=2) without percussion
instrument87.10%
Sheet3
Recognition Rate
Sound Separation
y54.55%61.20%
n64.28%67.69%70.13%
Chart10
0.612
0.6428
0.6769
Recall
Seperation Vs Non-Sperataion
Sheet1
pitch basedSound SeparationN(Labels)recall
YesSP154.55%
YesSP261.20%
YesNS264.28%
YesNS467.69%
NoNS470.13%
experiment #
pitch based
Sound Separation
N(Labels)
recall
1
Yes
Yes
1
54.55%
2
Yes
Yes
2
61.20%
3
Yes
NO
2
64.28%
4
Yes
NO
4
67.69%
5
No
NO
4
70.13%
Sheet1
0
0
Recall
Single Label Vs Multiple Label
Sheet2
0
0
0
Recall
Seperation Vs Non-Sperataion
Sheet3
experiment #descriptionRecognition Rate
1Feature-based and separation + Decision Tree (n=1)36.49%
2Feature-based and separation + Decision Tree (n=2)48.65%
3Spectrum Match + KNN (k=1;n=2)79.41%
4Spectrum Match + KNN (k=5;n=2)82.43%
5Spectrum Match + KNN (k=5;n=2) without percussion
instrument87.10%
Sheet3
0
0
0
0
0
Recognition Rate
Sound Separation
y54.55%61.20%
n64.28%67.69%70.13%
-
Polyphonic Sound(window)
Get frame
Feature extraction
Classifiers
Multiple labelsCompressed representations of the signal:
Harmonic Peaks, Mel Frequency Ceptral Coefficients (MFCC), Spectral
Flatness, .
Irrelevant information (inharmonic frequencies or partials) is
removed.
Violin and viola have similar MFCC patterns. The same is with
double-bass and guitar. It is difficult to distinguish them in
polyphonic sounds.
More information from the raw signal is needed.Polyphonic
Sounds
-
Short Term Power Spectrum low level representation of signal
(calculated by STFT)Power Spectrum patterns of flute & trombone
can be seen in the mixtureSpectrum slice 0.12 seconds long
-
Experiment:
Middle C instrument sounds (pitch equal to C4 in MIDI notation,
frequency -261.6 Hz
Training set: Power Spectrum from 3323 frames - extracted by
STFT from 26 single instrument sounds: electric guitar, bassoon,
oboe, B-flat, clarinet, marimba, C trumpet,E-flat clarinet, tenor
trombone, French horn, flute, viola, violin, English horn,
vibraphone,Accordion, electric bass, cello, tenor saxophone, B-flat
trumpet, bass flute, double bass,Alto flute, piano, Bach trumpet,
tuba, and bass clarinet.
Testing Set: Fifty two audio files are mixed (using Sound Forge
) by two of these 26 singleinstrument sounds.
Classifier (1) KNN with Euclidean distance (spectrum match based
classification); (2) Decision Tree (multi label classification
based on previously extracted features)
-
Timbre Pattern Match Based on Power Spectrum n number of labels
assigned to each frame; k parameter for KNN
experiment #descriptionRecallPrecisionF-score1Feature-based +
Decision Tree (n=2)64.28%44.8%52.81%2Spectrum Match + KNN
(k=1;n=2)79.41%50.8%61.96%3Spectrum Match + KNN
(k=5;n=2)82.43%45.8%58.88%4Spectrum Match + KNN (k=5;n=2) without
percussion instrument87.1%
Chart1
0.6428
0.7941
0.8243
0.871
Recognition Rate
Spectrum-based VS Feature-based
Sheet1
experiment #descriptionRecognition Rate
Feature-based64.28%
k=1Spectrum Match79.41%
k=5Spectrum Match82.43%
k=5Spectrum Match (without percussion )87.10%
Sheet1
Recognition Rate
Spectrum-based VS Feature-based
Sheet2
Sheet3
MBD06C9E170.xls
Chart10
0.612
0.6428
0.6769
Recall
Seperation Vs Non-Sperataion
Sheet1
pitch basedSound SeparationN(Labels)recall
YesSP154.55%
YesSP261.20%
YesNS264.28%
YesNS467.69%
NoNS470.13%
experiment #
pitch based
Sound Separation
N(Labels)
recall
1
Yes
Yes
1
54.55%
2
Yes
Yes
2
61.20%
3
Yes
NO
2
64.28%
4
Yes
NO
4
67.69%
5
No
NO
4
70.13%
Sheet1
0
0
Recall
Single Label Vs Multiple Label
Sheet2
0
0
0
Recall
Seperation Vs Non-Sperataion
Sheet3
experiment #descriptionRecognition Rate
1Feature-based and separation + Decision Tree (n=1)36.49%
2Feature-based and separation + Decision Tree (n=2)48.65%
3Spectrum Match + KNN (k=1;n=2)79.41%
4Spectrum Match + KNN (k=5;n=2)82.43%
5Spectrum Match + KNN (k=5;n=2) without percussion
instrument87.10%
Sheet3
0
0
0
0
0
Recognition Rate
Sound Separation
y54.55%61.20%
n64.28%67.69%70.13%
-
Hierarchical structureFluteEnglish HornViolinViola
-
Instrument granularity classifiers which are trained at each
level of the hierarchical treeHornbostel/Sachs
-
Modules of cascade classifier for single instrument estimation
--- Hornboch /SachsPitch 3B91.80%96.02%98.94%= 95.00%*>
-
New Experiment:
Middle C instrument sounds (pitch equal to C4 in MIDI notation,
frequency - 261.6 Hz
Training set: 2762 frames extracted from the following
instrument sounds: electric guitar, bassoon, oboe, B-flat,
clarinet, marimba, C trumpet,E-flat clarinet, tenor trombone,
French horn, flute, viola, violin, English horn,
vibraphone,Accordion, electric bass, cello, tenor saxophone, B-flat
trumpet, bass flute, double bass,Alto flute, piano, Bach trumpet,
tuba, and bass clarinet.
Classifiers WEKA: (1) KNN with Euclidean distance (spectrum
match based classification); Decision Tree (classification based on
previously extracted features)
Confidence ratio of the correct classified instances over the
total number of instances
-
Classification on different Feature Groups
GroupFeature descriptionClassifierConfidenceA33 Spectrum
Flatness Band CoefficientsKNN Decision Tree99.23%94.69%B13 MFCC
coefficientsKNN Decision Tree98.19%93.57%C28 Harmonic PeaksKNN
Decision Tree86.60%91.29%D38 Spectrum projection coefficientsKNN
Decision Tree47.45%31.81%ELog spectral centroid, spread, flux,
rolloff, zerocrossingKNN Decision Tree99.34%99.77%
-
Feature and classifier selection at each level of cascade
systemKNN + Band Coefficients
Nodefeature ClassifierchordophoneBand
CoefficientsKNNaerophoneMFCC coefficientsKNNidiophoneBand
CoefficientsKNN
Nodefeature Classifierchrd_compositeBand
CoefficientsKNNaero_double-reedMFCC
coefficientsKNNaero_lip-vibratedMFCC coefficientsKNNaero_sideMFCC
coefficientsKNNaero_single-reedBand CoefficientsDecision
Treeidio_struckBand CoefficientsKNN
-
Classification on the combination of different feature
groupsClassification based on KNNClassification based on Decision
Tree
-
From those two experiments, we see that:
KNN classifier works better with feature vectors such as
spectral flatness coefficients, projection coefficients and MFCC.
Decision tree works better with harmonic peaks and statistical
features.
Simply adding more features together does not improve the
classifiers and sometime even worsens classification results (such
as adding harmonic to other feature groups).
-
Feature and classifier selection at each level of Cascade System
- Hornbostel/Sachs hierarchical tree
Feature and classifier selection at top level
-
Feature and classifier selection at second level
-
Feature and classifier selection at third level
-
Feature and Classifier Selection Table for Level 1Feature and
Classifier Selection Table for Level 2Feature and Classifier
Selection
NodeFeatureClassifierchordophoneFlatness
coefficientsKNNaerophoneMFCC coefficientsKNNidiophoneFlatness
coefficientsKNN
NodeFeatureClassifierchrd_compositeFlatness
coefficientsKNNaero_double-reedMFCC
coefficientsKNNaero_lip-vibratedMFCC coefficientsKNNaero_sideMFCC
coefficientsKNNAero single-reedFlatness coefficientsDecision
TreeIdio StruckFlatness coefficientsKNN
-
HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS
Common method to calculate the distance or similarity between
clusters: single linkage (nearest neighbor), complete linkage
(furthest neighbor), unweighted pair-group method using arithmetic
averages (UPGMA), weighted pair-group method using arithmetic
averages (WPGMA), unweighted pair-group method using the centroid
average (UPGMC), weighted pair-group method using the centroid
average (WPGMC), Ward's method.
Most common distance functions: Euclidean, Manhattan, Canberra
(examines the sum of series of a fraction differences between
coordinates of a pair of objects), Pearson correlation coefficient
(PCC) measures the degree of association between objects,
Spearman's rank correlation coefficient.
Clustering algorithm HCLUST (Agglomerative hierarchical
clustering) R Package
-
Testing Datasets (MFCC, flatness coefficients, harmonic peaks)
:
The middle C pitch group which contains 46 different musical
sound objects. Each sound object is segmented into multiple 0.12s
frames and each frame is stored as an instance in the testing
dataset. There are totally 2884 frames
We also extract three different features (MFCC, flatness
coefficients, and harmonic peaks) from those sound objects. Each
feature produces one dataset of 2884 frames for clustering.
Clustering:When the algorithm finishes the clustering process, a
particular cluster ID is assigned to each single frame.
-
Contingency Table derived from clustering result
Cluster 1Cluster jCluster n
Instrument 1X11X1 jX1n
Instrument iXi1XijXin
Instrument nX n1X njX nn
-
Evaluation result of Hclust algorithm (14 results which yield
the highest score among 126 experimentsw number of clusters, -
average clustering accuracy of all the instruments, score= *w
FeaturemethodmetricwscoreFlatness
Coefficientswardpearson87.3%3732.30Flatness
Coefficientswardeuclidean85.8%3731.74Flatness
Coefficientswardmanhattan85.6%3630.83mfccwardkendall81.0%3629.18mfccwardpearson83.0%3529.05Flatness
Coefficientswardkendall82.9%3529.03mfccwardeuclidean80.5%3528.17mfccwardmanhattan80.1%3528.04mfccwardspearman81.3%3427.63Flatness
Coefficientswardspearman83.7%3327.62Flatness
Coefficientswardmaximum86.1%3227.56mfccwardmaximum79.8%3427.12Flatness
Coefficientsmcquittyeuclidean88.9%3026.67mfccaveragemanhattan87.3%3026.20
-
Clustering result from Hclust algorithm with Ward linkage method
and Pearson distance measure; Flatness coefficients are used as the
selected featurectrumpet and batchtrumpet are clustered in the same
group. ctrumpet_harmonStemOut is clustered in one single group
instead of merging with ctrumpet. Bassoon is considered as the
sibling of the regular French horn. French horn muted is clustered
in another different group together with English Horn and Oboe
.
-
Comparison between non-cascade classification and cascade
classification with different hierarchical schemas
ExperimentClassification
methodDescriptionRecallPrecisionF-Score1non-cascadeFeature-based64.3%44.8%52.81%2non-cascadeSpectrum-Match79.4%50.8%61.96%3CascadeHornbostel/Sachs75.0%43.5%55.06%4Cascadeplay
method77.8%53.6%63.47%5Cascademachine learned87.5%62.3%72.78%
-
We evaluate the classification system by the mixture sounds
which contain two single instrument sounds.
We also create 49 polyphonic sounds by randomly selecting three
different single instrument sounds and mixing them together.
We then test those three-instrument mixtures with five different
classification methods (experiment 2 to 6) which are described in
the previous two-instrument mixture experiments. Single-label
classification based on the sound separation method is also tested
on the mixtures (experiment 1). KNN (k=3) is used as the classifier
for each experiment.
-
Classification results of 3-instrument mixtures with different
algorithms
Exp# Classifier Method Recall PrecisionF-Score
1Non-CascadeSingle-label based on sound
separation31.48%43.06%36.37%2Non_CascadeFeature-based multi-label
classification
Spectrum-Match69.44%58.64%63.59%3Non_Cascademulti-label
classification85.51%55.04%66.97%4Cascade(hornbostel)multi-label
classification64.49%63.10%63.79%5Cascade(playmethod)multi-label
classification66.67%55.25%60.43%6Cascade(machine
Learned)multi-label classification63.77%69.67%66.59%
-
User entering queryUser is not satisfied and he is entering a
new query- Action Rules System
-
Action RuleAction rule is defined as a term
Information Systemconjunction of fixed condition features shared
by both groups proposed changes in values of flexible features
desired effect of the action[() ( )] ()
A B D
a1 b2 d1
a2 b2
a2 b2 d2
-
"Action Rules Discovery without pre-existing classification
rules", Z.W. Ras, A. Dardzinska, Proceedings of RSCTC 2008
Conference, in Akron, Ohio, LNAI 5306, Springer, 2008, 181-190
http://www.cs.uncc.edu/~ras/Papers/Ras-Aga-AKRON.pdf
-
Auto indexing system for musical instruments
intelligent query answering system for music instruments
WWW.MIR.UNCC.EDU
********
Spectrum Centroid describes the gravity center of the spectrum
Spectrum Spread describes the deviation of the power spectrum with
respect to the gravity center in a frame. Like Spectrum Centroid,
it is an economic way to describe the shape of the power
spectrum.Spectrum Band Coefficients describes the flatness property
of the power spectrum within a frequency bin projection
coefficients project the spectrum from high dimensional space of
spectrum to low dimensional space with compact salient statistical
information. Harmonic Peaks is a sequence of local peaks of
harmonics of each frame .Mel frequency cepstral coefficients
describe the spectrum according to the human perception system in
the mel scale. They are computed by grouping the STFT points of
each frame into a set of coefficients.Tristimulus The concept of
tristimulus originates in the world of colour, describing the way
three primary colours can be mixed together to create a given
colour. By analogy, the musical tristimulus measures the mixture of
harmonics in a given sound, grouped into three sections. parameters
describe the ratio of the energy of 3 groups of harmonic partials
to the total energy of harmonic partials The following groups are
used: fundamental, medium partials (2, 3, and 4) and higher
partials The first tristimulus measures the relative weight of the
first harmonic; the second tristimulus measures the relative weight
of the 2nd, 3rd, and 4th harmonics taken together; and the third
tristimulus measures the relative weight of all the remaining
harmonics.
***As Figure shows, the power spectrum patterns of single flute
and single trombone could still been identified in mixture spectrum
without blurring with each other (as marked in the figure).
Therefore, we do get the clear picture of distinct pattern of each
single instrument when we observe each spectrum slice of the
polyphonic sound wave. This explains the reason that human hearing
system could still accurately recognize the two different
instruments from the mixture instead of misclassifying them as some
other instruments. However those distinct timbre relevant
characteristics for each instrument preserved in the signal wont be
able to be observed in the previous feature space From the results
shown in Table , we get the following conclusions:1. Using the
multiple label classifier for each frame yields better results than
using single label classifier2. Spectrum-based KNN classification
improves the recognition rate of polyphonic sounds significantly.3.
Some percussion instrument (such as vibraphone, marimba) are not
suitable for spectrum-based classification, but most instruments
generating harmonic sounds work well with this new method.Energy
describe total energy of harmonic partials.
According to the previous discussion and conclusion, in order to
get the highest accuracy for the ultimate estimation at bottom
level of hierarchical tree, cascade system must be able to pick the
pair of feature and classifier from the available features pool and
classifiers pool in the way that system achieve the best estimation
at each level of cascade classification. To get such information,
We need to deduced the knowledge from current training database by
combining each feature from feature pool (A,B,C,D) with each
classifier from the classifier pool( NaiveBayes, KNN, Decision
Tree), and running the classification experiments in weka on the
subset which corresponds to the each node in the hierarchical tree
used by cascade classification system. *