SUBMITTED TO IEEE TRANS. ON AUDIO, SPEECH AND … · (relative) pitch without any form of supervision. The key ob-servation is that pitch shift maps to a simple translation when the

SUBMITTED TO IEEE TRANS. ON AUDIO, SPEECH AND LANGUAGE PROCESSING. 1

SPICE: Self-supervised Pitch EstimationBeat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi, Mihajlo Velimirovic

Abstract—We propose a model to estimate the fundamentalfrequency in monophonic audio, often referred to as pitchestimation. We acknowledge the fact that obtaining ground truthannotations at the required temporal and frequency resolutionis a particularly daunting task. Therefore, we propose to adopta self-supervised learning technique, which is able to estimate(relative) pitch without any form of supervision. The key ob-servation is that pitch shift maps to a simple translation whenthe audio signal is analysed through the lens of the constant-Qtransform (CQT). We design a self-supervised task by feedingtwo shifted slices of the CQT to the same convolutional encoder,and require that the difference in the outputs is proportional tothe corresponding difference in pitch. In addition, we introducea small model head on top of the encoder, which is able todetermine the confidence of the pitch estimate, so as to distinguishbetween voiced and unvoiced audio. Our results show that theproposed method is able to estimate pitch at a level of accuracycomparable to fully supervised models, both on clean and noisyaudio samples, yet it does not require access to large labeleddatasets.

Index Terms—audio pitch estimation, unsupervised learning,convolutional neural networks.

I. INTRODUCTION

Pitch represents the perceptual property of sound that allowsordering based on frequency, i.e., distinguishing between highand low sounds. For example, our auditory system is able torecognize a melody by tracking the relative pitch differencesalong time. Pitch is often confused with the fundamentalfrequency (f0), i.e., the frequency of the lowest harmonic.However, the former is a perceptual property, while the latteris a physical property of the underlying audio signal. Despitethis important difference, outside the field of psychoacousticspitch and fundamental frequency are often used interchangeably,and we will not make an explicit distinction within the scopeof this paper. A comprehensive treatment of the psychoacousticaspects of pitch perception is given in [1].

Pitch estimation in monophonic audio received a great dealof attention over the past decades, due to its central importancein several domains, ranging from music information retrievalto speech analysis. Traditionally, simple signal processingpipelines were proposed, working either in the time domain [2],[3], [4], [5], in the frequency domain [6] or both [7], [8], oftenfollowed by post-processing algorithms to smooth the pitchtrajectories [9], [10].

Until recently, machine learning methods had not beenable to outperform hand-crafted signal processing pipelinestargeting pitch estimation. This was due to the lack of annotateddata, which is particularly tedious and difficult to obtainat the temporal and frequency resolution required to trainfully supervised models. To overcome these limitations, a

All authors are with Google Research.A shortened version of this manuscript is under review at ICASSP2020.

synthetically generated dataset was proposed in [11], obtainedby re-synthesizing monophonic music tracks while setting thefundamental frequency to the target ground truth. Using thistraining data, the CREPE algorithm [12] was able to achievestate-of-the-art results when evaluated on the same dataset,outperforming signal processing baselines, especially undernoisy conditions.

In this paper we address the problem of lack of annotated datafrom a different angle. Specifically, we rely on self-supervision,i.e., we define an auxiliary task (also known as a pretexttask) which can be learned in a completely unsupervised way.To devise this task, we started from the observation that forhumans, including professional musicians, it is typically mucheasier to estimate relative pitch, related to the frequency intervalbetween two notes, than absolute pitch, related to the actualfundamental frequency [13]. Therefore, we design SPICE (Self-supervised PItCh Estimation) to solve a similar task. Moreprecisely, our network architecture consists of a convolutionalencoder which produces a single scalar embedding. We aimat learning a model that linearly maps this scalar value topitch, when the latter is expressed in a logarithmic scale, i.e.,in units of semitones of an equally tempered chromatic scale.To do this, we feed two versions of the same signal to theencoder, one being a pitch shifted version of the other by arandom but known amount. Then, we devise a loss functionthat forces the difference between the scalar embeddings to beproportional to the known difference in pitch. For convenience,we perform pitch shifting in the domain defined by the constant-Q transform, because this corresponds to a simple translationalong the log-spaced frequency axis. Upon convergence, themodel is able to estimate relative pitch. To translate this outputto an absolute pitch scale we apply a simple calibration stepagainst ground truth data. Since we only require to estimatea single scalar offset, a very small annotated dataset can beused for this purpose.

Another important aspect of pitch estimation is determiningwhether the underlying signal is voiced or unvoiced. Instead ofrelying on handcrafted thresholding mechanisms, we augmentthe model in such a way that it can learn the level ofconfidence of the pitch estimation. Namely, we add a simplefully connected layer that receives as input the penultimatelayer of the encoder and produces a second scalar value whichis trained to match the pitch estimation error.

As an illustration, Figure 1 shows the CQT frames of oneof the evaluation datasets (MIR-1k [14]), which are consideredto be voiced and sorted by the pitch estimated by SPICE.

In summary, this paper makes the following key contribu-tions:• We propose a self-supervised (relative) pitch estimation

model, which can be trained without having access to anylabelled dataset.

arX

iv:1

910.

1166

4v1

[ee

ss.A

S] 2

5 O

ct 2

019


55 Hz (A1)

110 Hz (A2)

220 Hz (A3)

440 Hz (A4)

880 Hz (A5)

1760 Hz (A6)

3520 Hz (A7)

7040 Hz (A8)

Fig. 1: CQT frames extracted from the MIR-1k dataset re-ordered based on the pitch estimated by the SPICE algorithm (in red).

• We incorporate a self-supervised mechanism to estimatethe confidence of the pitch estimation, which can bedirectly used for voicing detection.

• We evaluate our model against two publicly availablemonophonic datasets and show that in both cases weoutperform handcrafted baselines, while matching thelevel of accuracy attained by CREPE, despite having noaccess to ground truth labels.

• We train and evaluate our model also in the noisyconditions, where background music is present in additionto monophonic singing, and show that also in this case,match the level of accuracy obtained by CREPE.

The rest of this paper is organized as follows. Section IIcontrasts the proposed method against the existing literature.Section III illustrates the proposed method, which is evaluatedin Section IV. Conclusions and future remarks are discussedin Section V.

II. RELATED WORK

Pitch estimation: Traditional pitch estimation algorithmsare based on hand-crafted signal processing pipelines, workingin the time and/or frequency domain. The most common time-domain methods are based on the analysis of local maxima ofthe auto-correlation function (ACF) [2]. These approaches areknown to be prone to octave errors, because the peaks of theACF repeat at different lags. Therefore, several methods wereintroduced to be more robust to such errors, including, e.g., thePRAAT [3] and RAPT [4] algorithms. An alternative approachis pursued by the YIN algorithm [5], which looks for the localminima of the Normalized Mean Difference Function (NMDF),to avoid octave errors caused by signal amplitude changes.Different frequency-domain methods were also proposed, based,e.g., on spectral peak picking [15] or template matching withthe spectrum of a sawtooth waveform [6]. Other approachescombine both time-domain and frequency-domain processing,like the Aurora algorithm [7] and the nearly defect-free F0estimation algorithm [8]. Comparative analyses including mostof the aforementioned approaches have been conducted onspeech [16], [17] , singing voices [18] and musical instru-ments [19]. Machine learning models for pitch estimation inspeech were proposed in [20], [21]. The method in [20]first extracts hand-crafted spectral domain features, and thenadopts a neural network (either a multi-layer perceptron ora recurrent neural network) to compute the estimated pitch.In [21] consensus of other pitch trackers is used to get groundtruth, and a multi-layer perceptron classifier is trained on the

principal components of the autocorrelations of subbands froman auditory filterbank. More recently the CREPE [12] modelwas proposed, an end-to-end convolutional neural networkwhich consumes audio directly in the time domain. The networkis trained in a fully supervised fashion, minimizing the cross-entropy loss between the ground truth pitch annotations andthe output of the model. In our experiments, we compare ourresults with CREPE, which is the current state-of-the-art.

Pitch confidence estimation: Most of the aforementionedmethods also provide a voiced/unvoiced decision, often basedon heuristic thresholds applied to hand-crafted features. How-ever, the confidence of the estimated pitch in the voiced caseis seldom provided. A few exceptions are CREPE [12], whichproduces a confidence score computed from the activations ofthe last layer of the model, and [22], which directly addressesthis problem, by training a neural network based on hand-crafted features to estimate the confidence of the estimated pitch.In contrast, in our work we explicitly augment the proposedmodel with a head aimed at estimating confidence in a fullyunsupervised way.

Pitch tracking and polyphonic audio: Often, post-processing is applied to raw pitch estimates to smoothly trackpitch contours over time. For example, [23] applies Kalmanfiltering to smooth the output of a hybrid spectro-temporalautocorrelation method, while the pYIN algorithm [9] buildson top of YIN, by applying Viterbi decoding of a sequencesoft pitch candidates. A similar smoothing algorithm is alsoused in the publicly released version of CREPE [12]. Pitchextraction in the case of polyphonic audio remains an openresearch problem [24]. In this case, pitch tracking is evenmore important to be able to distinguish the different melodylines [10]. A machine learning model targeting the estimationof multiple fundamental frequencies, melody, vocal and bassline was recently proposed in [25] .

Self-supervised learning: The widespread success of fullysupervised models was stimulated by the availability ofannotated datasets. In those cases in which labels are scarseor simply not available, self-supervised learning has emergedas a promising approach for pre-training deep convolutionalnetworks both for vision [26], [27], [28] and audio-relatedtasks [29], [30], [31]. Somewhat related to our paper arethose methods that try to use self-supervision to obtain pointdisparities between pairs of images [32], where shifts in thespatial domain play the role of shifts in the log-frequencydomain.


Encoder

Decoder Decoder

Encoder

Pitch head Pitch headConf. head Conf. head

pitch shift error

Fig. 2: SPICE model architecture.

III. METHODS

Audio frontend

The proposed pitch estimation model receives as inputan audio track of arbitrary length and produces as outputa timeseries of estimated pitch frequencies, together with anindication of the confidence of the estimates. The latter is usedto discriminate between unvoiced frames, in which pitch is notwell defined, and voiced frames.

To better illustrate our method, let us first introduce acontinuous-time model of an ideal harmonic signal, that is:

xt =

K∑k=1

ak sin(2πkf0t+ φk), (1)

where f0 denotes the fundamental frequency and fk = kf0,k = 2, . . .K, its higher order harmonics. The modulus of theFourier transform is given by

|Xf | =1

2

K∑k=0

ak[δ(f − kf0) + δ(f + kf0)], (2)

where δ is the Dirac delta function. Therefore, the modulusconsists of spectral peaks at integer multiples of the fun-damental frequency f0. When the signal is pitch-shifted bya factor of α, these spectral peaks move to fk = αfk. Ifwe apply a logarithmic transformation to the frequency axis,log fk = logα+ log fk, i.e., pitch-shifting results in a simpletranslation in the log-frequency domain.

This very simple and well known result is at the core of theproposed model. Namely, we preprocess the input audio trackwith a frontend that computes the constant-Q transform (CQT).

In the CQT domain, frequency bins are logarithmically spaced,as the center frequencies obey the following relationship:

fk = fbase2k−1B , k = 1, . . . , Fmax. (3)

where fbase is the frequency of the lowest frequency bin, Bis the number of bins per octave, and Fmax is the number offrequency bins. Given an input audio track, the CQT producesa matrix X of size T ×Fmax, where T depends on the selectedhop length. Note that the frequency bins are logarithmicallyspaced. Therefore, if the input audio track is pitch-shifted bya factor α, this results in a translation of ∆k = B log2 α binsin the CQT domain.

Pitch estimation

The proposed model architecture is illustrated in Figure 2.Starting from the observation above, the model computes themodulus of the CQT |X|, and from each temporal framet = 1, . . . , T (where T is equal to the batch size during training)it extracts two random slices xt,1,xt,2 ∈ RF , spanning therange of CQT bins [kt,i, kt,i + F ], i = 1, 2, where F is thenumber of CQT bins in the slice and the offsets are sampledfrom a uniform distribution, i.e., kt,i ∼ U(kmin, kmax). Then,each vector is fed to the same encoder to produce a singlescalar yt,i = Enc(xt,i) ∈ R. The encoder is a neural networkwith L convolutional layers followed by two fully-connectedlayers. Further details about the model architecture are providedin Section IV.

We design our main loss in such a way that yt,i is encouragedto encode pitch. First, we define the relative pitch error as

et = |(yt,1 − yt,2)− σ(kt,1 − kt,2)| (4)


Then, the loss is defined as the Huber norm of the pitch error,that is:

Lpitch =1

T

∑t

h(et), (5)

where:

h(x) =

x2

2 , |x| ≤ ττ2

2 + τ(|x| − τ), otherwise.(6)

The pitch difference scaling factor σ is adjusted in sucha way that yt ∈ [0, 1] when pitch is in the range [fmin, fmax],namely:

σ =1

B[log2(fmax/fmin)](7)

The values of fmin and fmax are determined based on therange of pitch frequencies spanned by the training set. In ourexperiments we found that the Huber loss makes the modelless sensitive to the presence of unvoiced frames in the trainingdataset, for which the relative pitch error can be large, as pitchis not well defined in this case.

In addition to Lpitch, we also use the following reconstructionloss

Lrecon =1

T

∑t

‖xt,1 − xt,1‖22 + ‖xt,2 − xt,2‖22, (8)

where xt,i, i = 1, 2, is a reconstruction of the input frameobtained by feeding yi,t into a decoder xt,i = Dec(yi,t).Therefore, the overall loss is defined as:

L = wpitchLpitch + wreconLrecon, (9)

where wpitch and wrecon are scalar weights that determine therelative importance assigned to the two loss components.

Given the way it is designed, the proposed model can onlyestimate relative pitch differences. The absolute pitch of aninput frame is obtained by applying an affine mapping:

p0,t = b+ s · yt = b+ s · Enc(xt) [semitones], (10)

which depends on two parameters. We consider two cases:estimating only the intercept b, and setting s = 1/σ; estimatingboth the intercept b and the slope s. This is the only place whereour method requires access to ground truth labels. However,we can observe that: i) only very few labelled samples areneeded, as only one or two parameters need to be estimated; ii)synthetically generated labelled samples could be used for thispurpose; iii) some applications (e.g., matching melodies playedat different keys) might require only relative pitch. Section IVprovides further details on the robustness to the calibrationprocess.

Note that pitch in (10) is expressed in semitones and it canbe converted to frequency (in Hz) by:

f0,t = fbase2p0,t12 [Hz] (11)

Confidence estimation

In addition to the estimated pitch p0,t, we design our modelsuch that it also produces a confidence level ct ∈ [0, 1]. Indeed,when the input audio is voiced we expect to produce highconfidence estimates, while when it is unvoiced pitch is notwell defined and the output confidence should be low.

To achieve this, we design the encoder architecture to havetwo heads on top of the convolutional layers, as illustrated inFigure 2. The first head consists of two fully-connected layersand produces the pitch estimate yt. The second head consistsof a single fully-connected layer and produces the confidencelevel ct. To train the latter, we add the following loss:

Lconf =1

T

∑t

|(1−ct,1)−et/σ|2+ |(1−ct,2)−et/σ|2. (12)

This way the model will produce high confidence ct ∼ 1when the model is able to correctly estimate the pitch differencebetween the two input slices. At the same time, given thatour primary goal is to accurately estimate pitch, during thebackpropagation step we stop the gradients so that Lconf onlyinfluences the training of the confidence head and does notaffect the other layers of the encoder architecture.

Handling background music

The accuracy of pitch estimation can be severely affectedwhen dealing with noisy conditions. These emerge, for example,when the singing voice is superimposed over background music.In this case, we are faced with polyphonic audio and we wantthe model to focus only on the singing voice source. To dealwith these conditions, we introduce a data augmentation step inour training setup. More specifically, we mix the clean singingvoice signal with the corresponding instrumental backing trackat different levels of signal-to-noise (SNR) ratios. Interestingly,we found that simply augmenting the training data was notsufficient to achieve a good level of robustness. Instead, wealso modified the definition of the loss functions as follows.Let xct,i and xnt,i denote, respectively, the CQT of the cleanand noisy input samples. Similarly, yct,i and ynt,i denote thecorresponding outputs of the encoder. The pitch error loss ismodified by averaging four different variants of the error, thatis:

epqt = |(ypt,1 − yqt,2)− σ(kt,1 − kt,2)| p, q ∈ c, n, (13)

Lpitch =1

4

∑t

∑p,q∈c,n

h(epqt ). (14)

The reconstruction loss is also modified, so that the decoderis asked to reconstruct the clean samples only. That is:

Lrecon =1

T

∑t

‖xct,1 − xt,1‖22 + ‖xct,2 − xt,2‖22. (15)

The rationale behind this approach is that the encoder is inducedto represent in its output only the information relative to theclean input audio samples, thus learning to denoise the inputby separating the singing voice from noise.


TABLE I: Dataset specifications.

Length # of framesDataset # of tracks min max total voiced totalMIR-1k 1000 3s 12s 133m 175k 215kMDB-stem-synth 230 2s 565s 418m 784k 1.75MSingingVoices 88 25s 298s 185m 194k 348k

27.5HzA0

55.0HzA1

110.0HzA2

220.0HzA3

440.0HzA4

880.0HzA5

1760.0HzA6

(a) MIR-1k

27.5HzA0

55.0HzA1

110.0HzA2

220.0HzA3

440.0HzA4

880.0HzA5

1760.0HzA6

(b) MDB-stem-synth

27.5HzA0

55.0HzA1

110.0HzA2

220.0HzA3

440.0HzA4

880.0HzA5

1760.0HzA6

(c) SingingVoices

Fig. 3: Range of pitch values covered by the different datasets.

IV. EXPERIMENTS

Model parameters

First we provide the details of the default parameters usedin our model. The input audio track is sampled at 16 kHz. TheCQT frontend is parametrized to use B = 24 bins per octave,so as to achieve a resolution equal to one half-semitone perbin. We set fbase equal to the frequency of the note C1, i.e.,fbase ' 32.70 Hz and we compute up to Fmax = 190 CQTbins, i.e., to cover the range of frequency up to Nyquist. Thehop length is set equal to 512 samples, i.e., one CQT frameevery 32 ms. During training, we extract slices of F = 128CQT bins, setting kmin = 8 and kmax = 16. The Huber thresholdis set to τ = 0.25σ and the loss weights equal to, respectively,wpitch = 104 and wrecon = 1. We increased the weight ofthe pitch-shift loss to wpitch = 3 · 105 when training withbackground music.

The encoder receives as input a 128-dimensional vectorcorresponding to a sliced CQT frame and produces as outputtwo scalars representing, respectively, pitch and confidence.The model architecture consists of L = 6 convolutional layers.We use filters of size 3 and stride equal to 1. The numberof channels is equal to d · [1, 2, 4, 8, 8, 8], where d = 64 forthe encoder and d = 32 for the decoder. Each convolutionis followed by batch normalization and a ReLU non-linearity.Max-pooling of size 3 and stride 2 is applied at the outputof each layer. Hence, after flattening the output of the lastconvolutional layer we obtain an embedding of size 1024elements. This is fed into two different heads. The pitchestimation head consists of two fully-connected layers with,respectively, 48 and 1 units. The confidence head consists ofa single fully-connected layer with 1 output unit. The totalnumber of parameters of the encoder is equal to 2.38M. Notethat we do not apply any form of temporal smoothing to theoutput of the model.

The model is trained using Adam with default hyperparam-eters and learning rate equal to 10−4. The batch size is set to64. During training, the CQT frames of the input audio tracksare shuffled, so that the frames in a batch are likely to comefrom different tracks.

Datasets

We use three datasets in our experiments, whose detailsare summarized in Table I. The MIR-1k [14] dataset contains

0.0 0.5 1.0 1.5 2.0

Pitch error [semitones]

0.0

0.2

0.4

0.6

0.8

1.0

Cum

ula

tive D

ensi

ty F

unct

ion

SPICE

CREPE tiny

CREPE full

SWIPE

(a) MIR-1k

0.0 0.5 1.0 1.5 2.0

Pitch error [semitones]

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Cum

ula

tive D

ensi

ty F

unct

ion

SPICE

CREPE tiny

CREPE full

SWIPE

(b) MDB-stem-synth

Fig. 4: Raw Pitch Accuracy.

1000 audio tracks with people singing Chinese pop songs. Thedataset is annotated with pitch at a granularity of 10 ms andit also contains voiced/unvoiced frame annotations. It comeswith two stereo channels representing, respectively, the singingvoice and the accompaniment music. The MDB-stem-synthdataset [11] includes re-synthesized monophonic music playedwith a variety of musical instruments. This dataset was used totrain the CREPE model in [12]. In this case, pitch annotationsare available at a granularity of 29 ms. Given the mismatch ofthe sampling period of the pitch annotations across datasets,we resample the pitch time-series with a period equal to thehop length of the CQT, i.e., 32 ms. In addition to these publiclyavailable datasets, we also collected in-house the SingingVoices


TABLE II: Evaluation results.

MIR-1k MDB-stem-synthModel # params Trained on RPA (CI 95%) VRR RPA (CI 95%)SWIPE - - 86.6% - 90.7%CREPE tiny 487k many 90.7% 88.9% 93.1%CREPE full 22.2M many 90.1% 84.6% 92.7%SPICE 2.38M SingingVoices 90.6%± 0.1% 86.8% 89.1%± 0.4%SPICE 180k SingingVoices 90.4%± 0.1% 90.5% 87.9%± 0.9%

27.5Hz 55.0Hz 110.0Hz 220.0Hz 440.0Hz 880.0Hz 1760.0HzGround truth pitch

10-2

10-1

100

101

102

Abso

lute

pit

ch e

rror

(sem

itones)

0.5 semitones

Confidence bin: 10

Confidence bin: 9

Confidence bin: 8

Confidence bin: 7

Confidence bin: 6

Confidence bin: 5

Confidence bin: 4

Confidence bin: 3

Confidence bin: 2

Confidence bin: 1


0.0

0.2

0.4

0.6

0.8

1.0

Confidence

95%

50%

median

(a) SPICE


10-2

10-1

100

101

102

Abso

lute

pit

ch e

rror

(sem

itones)

0.5 semitones

Confidence bin: 10

Confidence bin: 9

Confidence bin: 8

Confidence bin: 7

Confidence bin: 6

Confidence bin: 5

Confidence bin: 4

Confidence bin: 3

Confidence bin: 2

Confidence bin: 1


0.0

0.2

0.4

0.6

0.8

1.0

Confidence

95%

50%

median

(b) CREPE full

Fig. 5: Pitch error on the MIR-1k dataset, conditional on ground truth pitch and model confidence.

dataset, which contains 88 audio tracks of people singing avariety of pop songs, for a total of 185 minutes.

Figure 3 illustrates the empirical distribution of pitch values.For SingingVoices, there are no ground-truth pitch labels, so weused the ouput of CREPE (configured with full model capacityand enabling Viterbi smoothing) as a surrogate. We observethat MDB-stem-synth spans a significantly larger range offrequencies (approx. 5 octaves) than MIR-1k and SingingVoices(approx. 3 octaves).

We trained SPICE using either SingingVoices or MIR-1kand used both MIR-1k (singing voice channel only) and MDB-stem-synth to evaluate models in clean conditions. To handlebackground music, we repeated training on MIR-1k, but thistime applying data augmentation by mixing in backing trackswith a SNR uniformly sampled from [-5dB, 25dB]. For theevaluation, we used the MIR-1k dataset, mixing the availablebacking tracks at different levels of SNR, namely 20dB, 10dBand 0dB. In all cases, we apply data augmentation duringtraining, by pitch-shifting the input audio tracks by an amountin semitones uniformly sampled in the set −12, 0,+12.

BaselinesWe compare our results against two baselines, namely

SWIPE [6] and CREPE [12]. SWIPE estimates the pitchas the fundamental frequency of the sawtooth waveform

whose spectrum best matches the spectrum of the inputsignal. CREPE is a data-driven method which was trainedin a fully-supervised fashion on a mix of different datasets,including MDB-stem-synth [11], MIR-1k [14], Bach10 [33],RWC-Synth [9], MedleyDB [34] and NSynth [35]. We considertwo variants of the CREPE model, by using model capacity tinyor full, and we disabled Viterbi smoothing, so as to evaluatethe accuracy achieved on individual frames. These modelshave, respectively, 487k and 22.2M parameters. CREPE alsoproduces a confidence score for each input frame.

Evaluation measuresWe use the evaluation measures defined in [24] to evaluate

and compare our model against the baselines. The raw pitchaccuracy (RPA) is defined as the percentage of voiced framesfor which the pitch error is less than 0.5 semitones. To assessthe robustness of the model accuracy to the initialization, wealso report the interval ±2σ, where σ is the sample standarddeviation obtained collecting the RPA values computed usingthe last 10 checkpoints of 3 separate replicas. For CREPE wedo not report such interval, because we simply run the modelprovided by the CREPE authors on each of the evaluationdatasets. The voicing recall rate (VRR) is the proportion ofvoiced frames in the ground truth that are recognized as voicedby the algorithm. We report the VRR at a target voicing false



10-2

10-1

100

101

102

Abso

lute

pit

ch e

rror

(sem

itones)

0.5 semitones

Confidence bin: 10

Confidence bin: 9

Confidence bin: 8

Confidence bin: 7

Confidence bin: 6

Confidence bin: 5

Confidence bin: 4

Confidence bin: 3

Confidence bin: 2

Confidence bin: 1


0.0

0.2

0.4

0.6

0.8

1.0

Confidence

95%

50%

median

(a) SPICE


10-2

10-1

100

101

102

Abso

lute

pit

ch e

rror

(sem

itones)

0.5 semitones

Confidence bin: 10

Confidence bin: 9

Confidence bin: 8

Confidence bin: 7

Confidence bin: 6

Confidence bin: 5

Confidence bin: 4

Confidence bin: 3

Confidence bin: 2

Confidence bin: 1


0.0

0.2

0.4

0.6

0.8

1.0

Confidence

95%

50%

median

(b) CREPE full

Fig. 6: Pitch error on the MDB-stem-synth dataset, conditional on ground truth pitch and model confidence.

TABLE III: Evaluation results on noisy datasets.

MIR-1kModel # params Trained on clean 20dB 10dB 0dBSWIPE - 86.6% 84.3% 69.5% 27.2%CREPE tiny 487k many 90.7% 90.6% 88.8% 76.1%CREPE full 22.2M many 90.1% 90.4% 89.7% 80.8%SPICE 2.38M MIR-1k + augm. 91.4%± 0.1% 91.2%± 0.1% 90.0%± 0.1% 81.6%± 0.6%

0.0 0.2 0.4 0.6 0.8 1.0

FP rate

0.0

0.2

0.4

0.6

0.8

1.0

TP r

ate

SPICE

CREPE tiny

CREPE full

Fig. 7: Voicing Detection - ROC (MIR-1k).

alarm rate equal to 10%. Note that this measure is providedonly for MIR-1k, since MDB-stem-synth is a synthetic datasetand voicing can be determined based on a simple silencethresholding.

Main results

The main results of the paper are summarized in Table II andFigure 4. On the MIR-1k dataset, SPICE outperforms SWIPE,while achieving the same accuracy as CREPE in terms of RPA(90.7%), despite the fact that it was trained in an unsupervised

fashion and CREPE used MIR-1k as one of the training datasets.Figure 5 illustrates a finer grained comparison between SPICEand CREPE (full model), measuring the average absolute pitcherror for different values of the ground truth pitch frequency,conditioned on the level of confidence (expressed in deciles)produced by the respective algorithm. When excluding thedecile with low confidence, we observe that above 110Hz,SPICE achieves an average error around 0.2-0.3 semitones,while CREPE around 0.1-0.5 semitones.

We repeated our analysis on the MDB-stem-synth dataset.In this case the dataset has remarkably different characteristicsfrom the SingingVoices dataset used for the unsupervisedtraining of SPICE, in terms of both frequency extension(Figure 3) and timbre (singing vs. musical instruments). Thisexplains why in this case the gap between SPICE and CREPEis wider (88.9% vs. 93.1%). Figure 6 repeats the fine-grainedanalysis for the MDB-stem-synth dataset, illustrating largererrors at both ends of the frequency range. We also performeda thorough error analysis, trying to understand in which casesCREPE and SWIPE outperform SPICE. We discovered thatmost of these errors occur in the presence of a harmonicsignal, in which most of the energy is concentrated above thefifth-order harmonics, i.e., in the case of musical instrumentscharacterized by a spectral timbre considerably different fromthe one of singing voice.


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Pitch head output

55.0Hz

110.0Hz

220.0Hz

440.0Hz

880.0Hz

Pit

ch g

round t

ruth

(a) MIR-1k

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Pitch head output

55.0Hz

110.0Hz

220.0Hz

440.0Hz

880.0Hz

Pit

ch g

round t

ruth

(b) MDB-stem-synth

Fig. 8: Calibration of the pitch head output.

0 50 100 150 200

Number of frames used for calibration

0.78

0.80

0.82

0.84

0.86

0.88

0.90

0.92

Raw

Pit

ch A

ccura

cy

Fig. 9: Robustness of the RPA on MIR-1k when varying thenumber of frames used for calibration.

We also evaluated the quality of the confidence estimationcomparing the voicing recall rate (VRR) of SPICE and CREPE.Results in Table II show that SPICE achieves results comparablewith CREPE (86.8%, i.e., between CREPE tiny and CREPElarge), while being more accurate in the more interesting lowfalse-positive rate regime (see Figure 7).

In order to obtain a smaller, thus faster, variant of the SPICEmodel, we used the MorphNet [36] algorithm. Specifically, weadded to the training loss (9) a regularizer which constrainsthe number of floating point operations (FLOPs), using λ =10−7 as regularization hyper-parameter. MorphNet producesas output a slimmed network architecture, which has 180kparameters, thus more than 10 times smaller than the originalmodel. After training this model from scratch, we were stillable to achieve a level of performance on MIR-1k comparableto the larger SPICE model, as reported in Table II.

Table III shows the results obtained when evaluating themodels in the presence of background music. We observe thatSPICE is able to achieve a level of accuracy very similar toCREPE across different values of SNR.

Calibration

The key tenet of SPICE is that is an unsupervised method.However, as discussed in Section III, the raw output of thepitch head can only represent relative pitch. To obtain absolutepitch, the intercept b (and, optionally, the slope s) in (10) needsto be estimated with the use of ground truth labels. Figure 8shows the fitted model for both MIR-1k and MDB-stem-synthas a dashed red line. We qualitatively observe that the interceptis stable across datasets. In order to quantitatively estimate howmany labels are needed to robustly estimate b, we repeated 100

bootstrap iterations. At each iteration we resample at randomjust a few frames from a dataset, fit b (and s) using thesesamples, and compute the RPA. Figure 9 reports the resultsof this experiment on MIR-1k (error bars represent 2.5% and97.5% quantiles). We observe that using as few as 200 framesis generally enough to obtain stable results. For MIR-1k thisrepresents about 0.09% of the dataset. Note that these samplescan also be obtained by generating synthetic harmonic signals,thus eliminating the need for manual annotations.

V. CONCLUSION

In this paper we propose SPICE, a self-supervised pitchestimation algorithm for monophonic audio. The SPICE modelis trained to recognize relative pitch without access to labelleddata and it can also be used to estimate absolute pitch bycalibrating the model using just a few labelled examples. Ourexperimental results show that SPICE is competitive withCREPE, a fully-supervised model that was recently proposed inthe literature, despite having no access to ground truth labels.

ACKNOWLEDGMENT

We would like to thank Alexandra Gherghina, Dan Ellis,and Dick Lyon for their help with and feedback on this work.

REFERENCES

[1] R. F. Lyon, Human and Machine Hearing. Cambridge University Press,may 2017. [Online]. Available: https://www.cambridge.org/core/product/identifier/9781139051699/type/book

[2] J. Dubnowski, R. Schafer, and L. Rabiner, “Real-time digital hardwarepitch detector,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 24, no. 1, pp. 2–8, feb 1976. [Online]. Available:http://ieeexplore.ieee.org/document/1162765/

[3] P. Boersma and P. Boersma, “Accurate short-term analysis of thefundamental frequency and the harmonics-to-noise ratio of a sampledsound,” IFA Proceedings 17, pp. 97—-110, 1993. [Online]. Available:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.218.4956

[4] D. Talkin, “A Robust Algorithm for Pitch Tracking (RAPT),”in Speech Coding and Synthesis. Elsevier, 1995, pp.495–518. [Online]. Available: https://www.semanticscholar.org/paper/A-Robust-Algorithm-for-Pitch-Tracking-(-RAPT-)-Talkin/f3f1d8960a6fde5bc4dc15acbfdce21cfc9b7452

[5] A. De Cheveigné and H. Kawahara, “YIN, a fundamental frequencyestimator for speech and music a),” Journal of the Acoustical Society ofAmerica, vol. 111, no. 4, pp. 1917–1930, 2002. [Online]. Available:http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf

[6] A. Camacho and J. G. Harris, “A sawtooth waveform inspired pitchestimator for speech and music,” The Journal of the Acoustical Societyof America, vol. 124, no. 3, pp. 1638–1652, sep 2008. [Online].Available: http://asa.scitation.org/doi/10.1121/1.2951592

https://www.cambridge.org/core/product/identifier/9781139051699/type/book

https://www.cambridge.org/core/product/identifier/9781139051699/type/book

http://ieeexplore.ieee.org/document/1162765/

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.218.4956

https://www.semanticscholar.org/paper/A-Robust-Algorithm-for-Pitch-Tracking-(-RAPT-)-Talkin/f3f1d8960a6fde5bc4dc15acbfdce21cfc9b7452



http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf

http://asa.scitation.org/doi/10.1121/1.2951592


[7] T. Ramabadran, A. Sorin, M. McLaughlin, D. Chazan, D. Pearce,and R. Hoory, “The ETSI extended distributed speech recognition(DSR) standards: server-side speech reconstruction,” in ICASSP, IEEEInternational Conference on Acoustics, Speech and Signal Processing- Proceedings, vol. 1. IEEE, 2004, pp. I–53–6. [Online]. Available:http://ieeexplore.ieee.org/document/1325920/

[8] H. Kawahara, A. de Cheveigné, H. Banno, T. Takahashi, andT. Irino, “Nearly defect-free F0 trajectory extraction for expressivespeech modifications based on STRAIGHT,” in Interspeech, 2005, pp.537–540. [Online]. Available: https://www.semanticscholar.org/paper/Nearly-defect-free-F0-trajectory-extraction-for-on-Kawahara-Cheveigné/a2ea9c2c4fd250ae7051d6e76e74e950bd3bdbb2

[9] M. Mauch and S. Dixon, “pYIN: A fundamental frequencyestimator using probabilistic threshold distributions,” in ICASSP, IEEEInternational Conference on Acoustics, Speech and Signal Processing- Proceedings. IEEE, may 2014, pp. 659–663. [Online]. Available:http://ieeexplore.ieee.org/document/6853678/

[10] J. Salamon and E. Gómez, “Melody Extraction from Polyphonic MusicSignals using Pitch Contour Characteristics,” IEEE Transactions onAudio, Speech, and Language Processing, vol. 20, no. 6, pp. 1759 – 1770,2012. [Online]. Available: http://www.music-ir.org/mirex/wiki/Audio

[11] J. Salamon, R. Bittner, J. Bonada, J. J. Bosch, E. Gómez, andJ. P. Bello, “An Analysis/Synthesis Framework for Automatic F0Annotation of Multitrack Datasets,” in 18th International Society forMusic Information Retrieval Conference, 2017. [Online]. Available:http://mtg.upf.edu/node/3830

[12] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE: A ConvolutionalRepresentation for Pitch Estimation,” in ICASSP, IEEE InternationalConference on Acoustics, Speech and Signal Processing - Proceedings,feb 2018. [Online]. Available: http://arxiv.org/abs/1802.06182

[13] N. Ziv and S. Radin, “Absolute and relative pitch:Global versus local processing of chords.” Advances incognitive psychology, vol. 10, no. 1, pp. 15–25, 2014.[Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/24855499http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3996714

[14] C.-L. H. Jang and J.-S. Roger, “On the Improvement of Singing VoiceSeparation for Monaural Recordings Using the MIR-1K Dataset,” IEEETransactions on Audio, Speech, and Language Processing, 2009. [Online].Available: https://sites.google.com/site/unvoicedsoundseparation/mir-1k

[15] P. Martin, “Comparison of pitch detection by cepstrum and spectralcomb analysis,” in ICASSP, IEEE International Conference on Acoustics,Speech and Signal Processing - Proceedings, 1982, pp. 180–183.

[16] D. Jouvet and Y. Laprie, “Performance Analysis of Several PitchDetection Algorithms on Simulated and Real Noisy Speech Data,” inEUSIPCO, European Signal Processing Conference, 2017. [Online].Available: http://www.speech.kth.se/snack/

[17] S. Strömbergsson, “Today’s most frequently used F 0 estimationmethods, and their accuracy in estimating male and femalepitch in clean speech,” in Interspeech, 2016. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2016-240

[18] O. Babacan, T. Drugman, N. Henrich, and T. Dutoit, “A comparativestudy of pitch extraction algorithms on a large variety of singing sounds,”in ICASSP, IEEE International Conference on Acoustics, Speech andSignal Processing - Proceedings, 2013, pp. 1–5. [Online]. Available:https://hal.archives-ouvertes.fr/hal-00923967

[19] A. Von Dem Knesebeck and U. Zölzer, “Comparison ofpitch trackers for real-time guitar effects,” in Digital AudioEffects (DAFX), 2010. [Online]. Available: http://dafx10.iem.at/papers/VonDemKnesebeckZoelzer_DAFx10_P102.pdf

[20] K. Han and D. Wang, “Neural Network Based Pitch Tracking inVery Noisy Speech,” IEEE/ACM Transactions on Audio Speech andLanguage Processing, vol. 22, no. 12, 2014. [Online]. Available: http://www.ieee.org/publications_standards/publications/rights/index.html

[21] B. S. Lee and D. P. W. Ellis, “Noise robust pitch tracking by subbandautocorrelation classification,” 13th Annual Conference of the Interna-tional Speech Communication Association 2012, INTERSPEECH 2012,vol. 1, pp. 706–709, 2012.

[22] B. Deng, D. Jouvet, Y. Laprie, I. Steiner, and A. Sini, “TowardsConfidence Measures on Fundamental Frequency Estimations,” inICASSP, IEEE International Conference on Acoustics, Speechand Signal Processing - Proceedings, 2017. [Online]. Available:https://hal.inria.fr/hal-01493168

[23] B. T. Bönninghoff, R. M. Nickel, S. Zeiler, and D. Kolossa, “UnsupervisedClassification of Voiced Speech and Pitch Tracking Using Forward-Backward Kalman Filtering,” in Speech Communication; 12. ITGSymposium, 2016, pp. 46–50.

[24] J. Salamon, E. Gomez, P. D. Ellis, and G. Richard,“Melody extraction from Polyphonic Music Signals: Approaches,Applications and Challenges,” IEEE SIgnal Processing Magazine,2014. [Online]. Available: https://pdfs.semanticscholar.org/688c/84aa99261f36dc78438bf436d6b518fd69d5.pdf

[25] R. M. Bittner, B. Mcfee, and J. P. Bello, “Multitask Learning forFundamental Frequency Estimation in Music,” Tech. Rep., 2018.[Online]. Available: https://arxiv.org/pdf/1809.00381.pdf

[26] M. Noroozi and P. Favaro, “Unsupervised Learning of VisualRepresentations by Solving Jigsaw Puzzles,” in European Conferenceon Computer Vision (ECCV), mar 2016, pp. 69–84. [Online]. Available:http://arxiv.org/abs/1603.09246

[27] D. Wei, J. Lim, A. Zisserman, and W. T. Freeman, “Learningand Using the Arrow of Time,” in Computer Vision and PatternRecognition Conference (CVPR), 2018, pp. 8052–8060. [Online].Available: http://people.csail.mit.edu/donglai/paper/aot18.pdf

[28] O. V. van den Oord, Yazhe Li, “Representation Learning withContrastive Predictive Coding,” Tech. Rep., 2019. [Online]. Available:https://arxiv.org/pdf/1807.03748.pdf

[29] A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu,R. C. Moore, and R. A. Saurous, “Unsupervised Learning of SemanticAudio Representations,” in ICASSP, IEEE International Conference onAcoustics, Speech and Signal Processing - Proceedings, nov 2018, pp.126–130. [Online]. Available: http://arxiv.org/abs/1711.02209

[30] M. Tagliasacchi, B. Gfeller, F. d. C. Quitry, and D. Roblek,“Self-supervised audio representation learning for mobile devices,” Tech.Rep., may 2019. [Online]. Available: http://arxiv.org/abs/1905.11796

[31] M. Meyer, J. Beutel, and L. Thiele, “Unsupervised Feature Learning forAudio Analysis,” in Workshop track - ICLR, 2017. [Online]. Available:http://people.ee.ethz.ch/matthmey/

[32] P. H. Christiansen, M. F. Kragh, Y. Brodskiy, and H. Karstoft,“UnsuperPoint: End-to-end Unsupervised Interest Point Detectorand Descriptor,” Tech. Rep., jul 2019. [Online]. Available: http://arxiv.org/abs/1907.04011

[33] Zhiyao Duan, B. Pardo, and Changshui Zhang, “Multiple FundamentalFrequency Estimation by Modeling Spectral Peaks and Non-PeakRegions,” IEEE Transactions on Audio, Speech, and LanguageProcessing, vol. 18, no. 8, pp. 2121–2133, nov 2010. [Online]. Available:http://ieeexplore.ieee.org/document/5404324/

[34] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam,and J. P. Bello, “MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research,” in Proceedings of the InternationalSociety for Music Information Retrieval (ISMIR) Conference,2014. [Online]. Available: https://www.semanticscholar.org/paper/MedleyDB%3A-A-Multitrack-Dataset-for-MIR-Research-Bittner-Salamon/e0ea1bb742b4958f5c84ece964ac9e3247d44015

[35] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan,and M. Norouzi, “Neural Audio Synthesis of Musical Noteswith WaveNet Autoencoders,” apr 2017. [Online]. Available: http://arxiv.org/abs/1704.01279

[36] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, andE. Choi, “Morphnet: Fast & simple resource-constrained structurelearning of deep networks,” in The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2018. [Online]. Available:https://arxiv.org/pdf/1711.06798.pdf


https://www.semanticscholar.org/paper/Nearly-defect-free-F0-trajectory-extraction-for-on-Kawahara-Cheveign/a2ea9c2c4fd250ae7051d6e76e74e950bd3bdbb2




http://www.music-ir.org/mirex/wiki/Audio

http://mtg.upf.edu/node/3830

http://arxiv.org/abs/1802.06182

http://www.ncbi.nlm.nih.gov/pubmed/24855499 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3996714

http://www.ncbi.nlm.nih.gov/pubmed/24855499 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3996714

https://sites.google.com/site/unvoicedsoundseparation/mir-1k

http://www.speech.kth.se/snack/

http://dx.doi.org/10.21437/Interspeech.2016-240

https://hal.archives-ouvertes.fr/hal-00923967

http://dafx10.iem.at/papers/VonDemKnesebeckZoelzer_DAFx10_P102.pdf

http://dafx10.iem.at/papers/VonDemKnesebeckZoelzer_DAFx10_P102.pdf

http://www.ieee.org/publications_standards/publications/rights/index.html

http://www.ieee.org/publications_standards/publications/rights/index.html

https://hal.inria.fr/hal-01493168

https://pdfs.semanticscholar.org/688c/84aa99261f36dc78438bf436d6b518fd69d5.pdf

https://pdfs.semanticscholar.org/688c/84aa99261f36dc78438bf436d6b518fd69d5.pdf

https://arxiv.org/pdf/1809.00381.pdf


http://people.csail.mit.edu/donglai/paper/aot18.pdf




http://people.ee.ethz.ch/matthmey/




https://www.semanticscholar.org/paper/MedleyDB%3A-A-Multitrack-Dataset-for-MIR-Research-Bittner-Salamon/e0ea1bb742b4958f5c84ece964ac9e3247d44015






SUBMITTED TO IEEE TRANS. ON AUDIO, SPEECH AND … · (relative) pitch without any form of supervision. The key ob-servation is that pitch shift maps to a simple translation when the

Documents