ICT-GroupProject-Report2-NguyenDangHoa_2

University of Sciences and Technologies of Hanoi

ICT Department

GROUP PROJECT REPORT

Pitch detection algorithms

and application in musical key detection

Group members

NGUYEN Dang Hoa USTHBI4-055

NGUYEN Gia Khang USTHBI4-072

NGUYEN Thi Thu Linh USTHBI4-085

NGUYEN Duc Thang USTHBI4-139

NGUYEN Minh Tuan USTHBI4-155

Supervisor

Dr. TRAN Hoang Tung

University of Science and Technology of Hanoi

February, 2016

GROUP PROJECT REPORT Pitch detection algorithms

Contents

Table of Abbreviations 2

Abstract 3

1 Introduction 4

2 Project management status 4

3 Theoretical background and state of the art 6

3.1 Pitch detection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Musical key detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Scientific methods and materials 8

4.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Pitch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2.1 YIN Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2.2 Cepstrum Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2.3 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.3 Musical key detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3.1 Generating a PCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3.2 PCP comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3.3 JAVA implementation/Android application . . . . . . . . . . . . . . . 12

5 Results and discussion 14

5.1 Pitch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Key detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Conclusion 17

Acknowledgement 18

References 19

Appendix A 20

Page 1 of 21


Table of Abbreviations

Discrete Fourier transform . . . DFT

Fundamental frequency . . . f0

Graphic user interface . . . GUI

Inverse discrete Fourier transform . . . IDFT

Low-pass filter . . . LPF

Pitch class profile . . . PCP

Pitch detection algorithm . . . PDA

Pitch period . . . PP

Simplified Inverse Filter Tracking . . . SIFT

Page 2 of 21


Abstract

A research was conducted to study the performance of pitch detection algorithms and their

application in musical key detection using pitch class profiles. Sample sets to test them was

constructed from voiced/unvoiced and clean/noisy sound signals. The tested algorithms are:

YIN estimator, Cepstrum analysis and Simplified Inverse Filter Tracking. Each algorithm

was tested on the samples, and the pitch contours generated are discussed to compare their

strength and weaknesses in different environments. A pitch profile generator was developed

in JAVA using all three algorithms, and an Android application was created afterwards to

perform musical key detection of sound sequences recorded by mobile phone users. Key

detection results are then shown and discussed for each of the tested algorithms.

Page 3 of 21


1 Introduction

Fundamental frequency (f0) in signal processing is defined as the lowest frequency of a

periodic signal and also the inverse of its period. f0 usually defines the subjective pitch of

a sound and f0 estimation, also known as pitch detection, throughout many years remains

a popular topic for research. It is useful in various contexts, from digital music processing

programs to voice encoder or speech support for the hearing impaired.

Despite being essential to signal processing systems, as of now there has yet to be an ideal

pitch detection algorithm (PDA). When subjected to a clean and clearly pitched signal, most

PDAs perform well, however if the input has heavy noise or multiple pitches, the results vary

significantly. Whether a PDA is truly good or not boils down to the condition of the signals

it is applied to, as each one has its own strength in some scenarios but do very weakly with

others.

At the time of starting this study a wide variety of PDAs were available, but we decided

to focus on a selected few that were deemed most reasonable in order to compare their

strengths and weaknesses within the same environment - detecting musical key of a recorded

voice sequence. The major goal is to establish a performance evaluation of these PDAs,

while developing a mobile application to implement them effectively.

The main part of our reports includes 5 sections excluding the Introduction. Section 2

(Project management status) assesses the overall progress. Section 3 (Theoretical background

and state of the art) provides a literature review on PDAs and basic music theories. Section

4 (Scientific methods and materials) describes the tools and step-by-step methods used dur-

ing the project. Section 5 (Results and discussion) explains the results obtained from our

implementations and our comments on them, and Section 6 concludes the report.

2 Project management status

From the start of this project, our group and the supervisor held weekly meetings in ICT Lab

to discuss the project goals and overall progress. Initially, as we had no prior experience with

mobile programming, work division between the five members was 3 on literature review and

2 on the development of an Android application. We aimed for a pitch detection application

Page 4 of 21


at first, but decided to expand into a wider scope after doing additional research on music

processing and the need of a related application. Adjustments were made on the way, and

eventually the group set down to a key detection application, since it is possible to develop

with our knowledge then and we would have difficulties trying to achieve a more complex

objective.

Details on the tasks and achievements are described in Table 2.1.

Task In charge Outcome

General research on digital sound

processing

Everyone Basic comprehension of digital sig-

nals, sampling, filtering, etc.

Develop basic Android application

for sound recording

Thang,

Tuan

Runnable application

Research on pitch detection algo-

rithms

Khang,

Linh, Hoa

Proposed three suitable algorithms

In-depth research on proposed al-

gorithms and key detection

Khang,

Linh, Hoa

Pseudocodes and MATLAB tests

Implement proposed algorithms on

JAVA

Thang,

Tuan

Done

Research on musical key detection Hoa, Linh Proposed a method to detect keys

Develop Android application for

key detection with user interface

Thang,

Tuan

Runnable application with simple

GUI

Putting the report together Linh,

Khang

Done

Table 2.1. Project management and progress.

Page 5 of 21


3 Theoretical background and state of the art

In this section, we provide an overview of the types of PDA chosen for investigation, their

characteristics, basic knowledge on music theories and some most prominent research regard-

ing these matters.

3.1 Pitch detection algorithms

Accurate and reliable pitch measurement is often extremely difficult for many reasons: the

voice sequence is often not a perfect train of periodic pulses, one sequence can be composed

from a variety of PPs which are hard to separate, etc. Therefore, it is necessary to have a

grasp on current studies so we can use them into this project.

PDAs are most commonly classified in three categories: time-domain, frequency-domain

or hybrid. Time-domain methods run directly on the speech waveform, frequency-domain

methods take advantage of the impulse series that arise in the frequency spectrum, and

hybrid ones incorporates properties of both domains. From each category, we picked one

signature PDA as follows:

• YIN Estimator (Time-domain): Autocorrelation method, a prominent representative

in this category, attempts to find PP by evaluating primary peaks of the input’s au-

tocorrelation. It is good with mid to low frequencies, but makes too many errors in

various applications. YIN estimator - developed by De Cheveigne and Kawahara in

the early 2000s - is based on the basic principles of autocorrelation method but with

several modifications in order to solve the problem. It minimizes the difference between

the input and its delayed copy, thus reduce the errors. Cheveigne and Kawahara have

theorized that YIN can be implemented efficiently with low latency, and has no upper

boundary in the pitch search range.

• Cepstrum Analysis (Frequency-domain): Cepstrum - a word play on spectrum first

defined by Bogert et al in a 1963 paper [1] - is essentially the inverse discrete Fourier

transform (IDFT) of the log magnitude of the spectrum of a signal. In 1967, Schroeder

and Noll proposed an application of cepstrum analysis in pitch detection, which is

based on the fact that the Fourier transform of a signal usually has regular peaks

Page 6 of 21


representing its harmonic spectrum [3]. Taking the cepstrum of a signal eliminates

those peaks, thus remove the effects of overtones in human voice and make it much

easier to define the pitch.

• Simplified Inverse Filter Tracking (Hybrid): This was first proposed by Markel in 1972

[5] as a simple algorithm, possible to be realized in real time yet covered the positive

traits of both autocorrelation and cepstral methods. This algorithm suggests fast

runtime with a composition of elementary computations, while also offers to classify

between voiced/unvoiced regions of an input. Its core operations were based on a

simplified version of digital inverse filtering, hence the name “Simplified Inverse Filter

Tracking” (from here on referred to as SIFT).

3.2 Musical key detection

In music, the term note is used to specify frequencies within a certain range of pitch which

human ear has similar perception and can hardly distinguish. Any two notes whose ratio is

a power of two are grouped into a pitch class. Generally, we divide pitches into 12 classes:

C, C](D[), D, D](E[), E, F, F](G[), G, G](A[), A, A](B[), and B. In each pitch class, notes

can be distinguished by adding a number after the notation of its class name. For instance,

C3 has lower frequency than C4, C5 and so on.

A piece of music is an ordered set of notes. However, in order to create a good music, this

set is often limited to less than twelve pitch classes. In most cases, this number is around

seven. These specific classes in the song, which can be denoted as its scale, forms an abstract

concept called tonality. Tonality is mostly derived from human sense over a song rather than

any exact definition, which means that two pieces of same-tonic music will be perceived

relatively similar.

Figure 3.1. Example of main pitch classes within C scale.

Page 7 of 21


Most music is composed in a major or minor scale, each scale has a ”key” note (for example,

C major scale is in key C major) which means there are a total of 24 major/minor scales.

Determining the key of a song is crucial to musician, yet also extremely difficult because

there is no mathematical formula to define or even guess it after capturing the set of notes

in the song.

3.3 State of the art

Throughout the history of pitch tracking, few thorough studies to compare different types

of detection methods have been conducted. Most research focus on the properties and

applications of one method alone, due to the difficulties in selecting algorithms to evaluate,

setting a reasonable standard of comparison and compiling a comprehensive database. For

the fundamental part of our study, we decided to look at papers which introduced the

concepts of chosen PDAs as follows:

• YIN, a fundamental frequency estimator for speech and music [2]

• Cepstrum Pitch Determination [6]

• The SIFT Algorithm for Fundamental Frequency Estimation [5]

Musical key detection using pitch class profiles (PCPs) on the other hand was under extensive

research, with different dataset of various genres generating different base key profiles [7].

The general goal for such research tends to be to shape the principle of key detection in

human brain. For practical purpose, we focused on one algorithm proposed in a 2007 Master

thesis from the University of Vienna [8].

4 Scientific methods and materials

We discuss in this section our approach to PDA implementation, to key detection and to

mobile application development. The step-by-step process we propose might not be optimal,

but is simple enough to deploy using our current skills and tools.

Page 8 of 21


4.1 Tools

For this study, the following softwares and tools were used:

• IntelliJ 14.1.5 / Eclipse 4.5.1

• Android Studio 1.5

• Audacity

• Android phones

4.2 Pitch detection

Initial experiments were conducted using JAVA. We implemented the three PDAs according

to their proposed formulas on a set of pre-recorded sound samples to see the margin of

difference in their pitch estimates.

The samples used are of a female voice singing ‘ah’ at pitches from G]3 to B3.

The steps for each of the PDAs are described as follows:

4.2.1 YIN Estimator

First, a difference function is applied on the input signal xt:

dt(τ) =W∑j=1

(xj − xj+τ )2

dt(τ) is zero at zero lag and often nonzero at the period because of the imperfect periodicity,

therefore a cumulative mean normalized difference function is applied to avoid the zero lag

dip, normalize the function for the next step and reduce too-high errors.

d′t(τ) =

1 τ = 0

dt(τ)1τ

∑τ1 dt(j)

otherwise

Page 9 of 21


An absolute threshold is applied to reduce the too-low errors, then each local minimum d′tis subjected to parabolic interpolation in order to define the PP estimate.

Finally, for each index t, we search for a minimum of d′θ(Tθ) for θ within [t − Tmax/2, t +

Tmax/2] where Tθ is the estimate at time θ and Tmax is 25ms. The best local estimate

obtained is the pitch of xt.

4.2.2 Cepstrum Analysis

The cepstrum of a signal is defined with the following formula:

cn = F−1{log(|F (xn)|)}

For our purpose of pitch detection, the cepstrum of a windowed frame of signal is necessary

an is defined through the Fourier series:

cn =N−1∑n=0

log(|N−1∑n

xne−jk 2π

Nn|)ejk

2πNn

The pitch can then be estimated by picking the peak of the resulting signal.

DFT log IDFTxn Xk Xk cn

Figure 4.1. Block diagram of Cepstrum analysis.

4.2.3 SIFT

First, the input signal sn with sampling frequency 10kHz is low-pass filtered with a cutoff

at fc = 0.8kHz. The filter output xn is downsampled by a 5:1 ratio to reduce the number of

operations in later steps but still retains correctness.

The signal is then analyzed frame-by-frame, with a 64-sample frame length and 32-sample

frame shift. A 4th-order linear predictive analysis is then performed to obtain a set of

coefficients, then the frame is inverse filtered using said set to produce a residual signal.

Page 10 of 21


Consequently, the autocorrelation of that signal is searched for the primary peak which is used

to determine f0. Finally, the autocorrelation function is interpolated in the neighborhood of

the calculated pitch to increase the resolution of f0.

LPF 0.8kHz

5:1

Inverse Filter Autocorrelation Interpolationsn xn wn yn rn f0

Figure 4.2. Block diagram of the SIFT algorithm.

Full details on the formulas involved can be found in Appendix A.

4.3 Musical key detection

The basic process is in three steps: pitch detection, pitch class profile (PCP) generation and

PCP comparison.

4.3.1 Generating a PCP

A pitch class profile (PCP) is a 12-dimension vector whose each parameter represents the

intensity of a pitch class. Generating a PCP is the first step in the key detection process

since it then will be compared to the referenced profile to find the most suitable key that fit

the generated PCP.

4.3.2 PCP comparison

A generated PCP will be compared to 24 standard PCPs of 24 keys to find the closest one.

In this project, we used the linear comparison algorithm, which was proved in several related

papers to give the closest result.

The base key profile used is one derived by Krumhansl and Kessler in 1982 [4].

Page 11 of 21


C C] D D] E F F] G G] A A] B1

2

3

4

5

6

7In

tensi

tyFigure 4.3. Example: C minor key profile of Krumhansl and Kessler.

4.3.3 JAVA implementation/Android application

Before moving on to Android, a test version on JAVA is developed. The key detection part

of the program runs basically as follows:

• After obtaining the pitch array from the buffers created, the output will be put through

the function intensityNote() to generate a PCP vector (an array of Note objects) of

the whole song:

1 public Note[] intensityNote(List <Note > noteList){2 Note[] notes = Note.copy(Note.NOTES);3 for(int i = 0; i < notes.length; i++ ) {4 double intensity = 0;5 for (Note item: noteList) {6 if(notes[i]. equals(item)){7 intensity += item.getIntensity ();8 }9 }

10 notes[i]. setIntensity(intensity);11 }12 return notes;13 }

• These ”raw” PCP will be normalised to the range 0.0 to 1.0 in order to be compatible

in the subsequent comparison process. The “loudest” note (the note that has highest

intensity) is set the value of 1 and vice versa:

1 public void normalize(Note[] notes){2 double max = notes [0]. getIntensity ();3 double min = notes [0]. getIntensity ();4 for(Note note: notes){5 if(max < note.getIntensity ()) max = note.getIntensity ();

Page 12 of 21


6 if(min > note.getIntensity ()) min = note.getIntensity ();7 }89 for(Note note: notes){

10 double intense = 1 - ((max - note.getIntensity ())/(max -min));11 note.setIntensity(intense);12 }13 }

After that, we will get the PCP, which each key has an unique range (for example...).

1 for(Note note: notes){2 profile[i] = note.getIntensity ();3 i++;4 }

• Using findKey(profile) to compare with the key database, we will get the key output.

In this project, we utilised the linear comparison algorithm. The key that has the

vector with the smallest distance to the generated PCP will be assigned to be the main

key of the song:

1 public static Key findKey(double [] notes){2 Key key = new Key();34 double min_error = Double.MAX_VALUE;56 for(Key k:Key.KEYS){7 double distance = 0;8 for(int i = 0; i < notes.length; i++){9 distance += (notes[i] - k.getSequence ()[i]. getIntensity ())*(

notes[i] - k.getSequence ()[i]. getIntensity ());10 }11 if(distance < min_error){12 min_error = distance;13 key = k;14 }15 System.out.println(k.getName () + ": " + distance);16 }17 return key;18 }

The idea for the Android application is to let users sing and record a sequence of notes in

a song, analyze the input and return the appropriate key. The user can choose between the

three algorithms to use for pitch detecting.

The application is tested on several mobile phones with different OS version and hardware

specifications. Some of them are:

• Asus Zenphone 5 501CG, CPU x86 Intel, OS version: 4.3/5.0

Page 13 of 21


• Xiaomi Redmi Note 2, CPU ARM Mediatek, OS version: 5.0

• Vega Sky A850, CPU ARM Snapdragon, OS version: 4.1.2/4.4.4/5.0

5 Results and discussion

We demonstrate in this section the results obtained from the experiments in Section 3, along

with our discussion and comments on the matter based on our knowledge.

5.1 Pitch detection

Results from using YIN estimator, Cepstrum Analysis and SIFT on 16 cleanly-pitched sound

samples are presented in Figure 4.1. and Table 4.2.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16200

250

300

350

400

450

500

Inte

nsi

ty

True pitchSIFTYIN estimatorCepstrum

Figure 5.1. Pitch estimates comparison between three PDAs.

True pitch curve indicates the ideal pitches from G]3 to B3.

Page 14 of 21


Case Base YIN Cepstrum SIFT Case Base YIN Cepstrum SIFT

1 207.6 209.7 206.0 210.5 9 329.6 326.5 326.0 320.0

2 220.0 218.7 218.0 216.2 10 349.2 348.8 350.0 347.8

3 233.1 234.2 234.0 235.3 11 370.0 368.5 366.0 363.6

4 246.9 244.8 244.0 242.4 12 392.0 389.2 390.0 381.0

5 261.6 259.2 262.0 258.0 13 415.3 409.8 408.0 421.0

6 277.2 276.4 278.0 275.8 14 440.0 436.4 436.0 444.4

7 293.7 291.6 292.0 296.3 15 466.2 460.7 458.0 470.6

8 311.1 306.7 306.0 307.7 16 493.9 488.1 490.0 500.0

Table 5.2. Obtained f0 (Hz) as demonstrated in Figure 4.1.

We can see that the curves created by the pitch estimation of all three algorithms do not

deviate too far from the true pitch curve. The higher the pitch the less accurate the results

(from a margin of less than 2Hz in the first few samples to 7-10Hz in the last ones). However,

we only use the estimates to perform key detection and adjacent notes are more different

than another the higher their pitches are, so up-to-10Hz is a good enough margin.

A plot of the pitch contours of each case showed that SIFT provided most consistent results

over the frames analyzed (a perfect contour even, in case 8 and 16) except for very occasional

surges or dips. This could be because of the lack of a voiced/unvoiced decision criteria, as

even in a voice speech some frames can be unvoiced and have unrealistic f0.

It is worth noting however that due to the inaccuracies in human hearing while making the

samples, the real pitches of our samples are not exactly the same as the true pitch of the

note range chosen. YIN and Cepstrum generated almost the same results with margin of

difference within 2Hz in 12/16 cases, making it probable that their results are closer to the

real pitch estimates than SIFT.

We have also tested the algorithms with noisy and unvoiced samples to find out if there are

any weaknesses. SIFT and Cepstrum returned the same results in the experiment with 16

same samples but with additive white Gaussian noise (AWGN), signal-to-noise ratio from

0.001dB - 1dB, but YIN couldn’t detect the pitch at all. SIFT even yielded the exact same

Page 15 of 21


numbers, probably because the AWGN was eliminated during prefiltering. On the other

hand, with unvoiced but toned signals, Cepstrum performed poorly because of its nature:

generating irregular, too high or too low frequencies.

5.2 Key detection

We successfully developed a working application, albeit with a very simple user interface. As

shown in Figure 4.2., the GUI composes of a pair of record/stop button for sound recording

and the options to choose which algorithm the key detection process will be based on.

Figure 5.3. Mobile app GUI while recording and after doing key detection

To test the accuracy of the application, we recorded 8 melodies of different keys, tried

different algorithms and collected the results as shown in Table 4.3.

The majority of key detection is good but with some inaccuracies. This is a reasonable

outcome, as the accuracy relies on many factors: quality of the recordings, whether the

recorded tones is truly of the expected pitches, whether the base key profiles used are suitable,

etc. The key profile in particular is an extremely important factor, because it was derived

from a dataset of a group of input and can be good or bad depending on the dataset size

and their nature. The wrongly detected keys in this experiment all belong to the harmonic

scale of the true key and have little difference in pitch class intensity, so we can conclude

Page 16 of 21


that the algorithms are working very close to the expected results.

Case True key YIN SIFT Cepstrum

1 C major C major F major C major

2 C major G major C major G major

3 E minor C minor E minor E minor

4 F major F major F major F major

5 G major D major E minor E minor

6 G minor G minor G minor G minor

7 A minor A minor E major C major

8 A major A major A major A major

Table 5.4. Key detection test results on mobile application.

All algorithms missed 3 out of 8 cases.

6 Conclusion

At the end of this project, we have succeeded in using the three PDAs for pitch detection.

Apparently, while all PDAs deliver good f0 estimates in general, each of them has its own

pros and cons. Our experiment show that when applied to clean signals, YIN estimator

and Cepstrum have closer results than SIFT, but YIN cannot detect the pitch of heavily

noised signal and Cepstrum does not work with unvoiced inputs. We did not implement

the voiced/unvoiced decision part of SIFT but our version worked reasonably well with all

tested samples - although with results slightly less accurate - and the calculations in SIFT are

simple making it easy to implement on any platform. Our pitch and key detection programs,

starting from MATLAB to JAVA and Android Studio, were developed successfully, but not

without a long period spent on optimizing the algorithms to shorten runtime and make them

more suitable for mobile phones. Overall, we have proven the ability as well as limitations

of PDAs, so even when they are not the perfect solutions to pitch and key detection we can

still take advantage of them.

Page 17 of 21


Nevertheless, on the course of this project, our group have experienced various difficulties.

One of the most significant problem is the lack of recent research papers and theoretical

resources, as related documents are usually not available to public viewers. Even if there is

any they most likely are intended for experienced readers, so we had to rely mostly on the

original research of the PDAs along with the supplements they provided. Another difficulty

is time constraint, as we had to process a large amount of new information and skills beyond

our understanding at the start of the project, that without the advice from our supervisor

we wouldn’t be able to deal with. Task division and collaborations between group members

proved problematic too at first, but we have improve our teamworking skills over the time

and were able to overcome this obstacle.

Even though it has come to an end, from the knowledge we have gained during the project,

we are aware that the application has the potential of a fully usable and marketable product.

We are looking forward to further studies on signal and music processing in the future in

order to improve it into a more complete version.

Acknowledgement

We would like to express our heartfelt appreciation to Dr. TRAN Hoang Tung for the

patience and enthusiasm that he guided us with during the course of this project. Without

your help, we would not be able to complete our work successfully.

Our gratitude also to the staff of Information and Communication Technology Department

and ICT Lab for their valuable assistance.

Page 18 of 21


References

[1] Bogert, B. P., Healy, M. J. R., and Tukey, J. W. (1963). The Quefrency Alanysis of

Time Series for Echoes: Cepstrum, Pseudo Autocovariance, Cross-Cepstrum and Saphe

Cracking, Proceedings of the Symposium on Time Series Analysis, Chapter 15, 209-243.

[2] De Cheveigne, A., and Kawahara, H. (2002). YIN, a fundamental frequency estimator

for speech and music, Journal of the Acoustical Society of America, 111, 1917-1930.

[3] Gerhard, D. (2003) Pitch Extraction and Fundamental Frequency: History and Current

Techniques, technical report, Dept. of Computer Science, University of Regina.

[4] Krumhansl, C. L., and Kessler, E. J. (1982) Tracing the Dynamic Changes in Perceived

Tonal Organization in a Spatial Representation of Musical Keys, Psychological Review,

89-4, 334-368.

[5] Markel, J. D. (1972). The SIFT algorithm for fundamental frequency estimation. IEEE

Trans. Audio Electroacoust., AU-20, 367-377.

[6] Noll, M. A. (1967), Cepstrum Pitch Determination, Journal of the Acoustical Society of

America, 41-2, 293-309.

[7] Temperley, D., and Marvin, E. W. (2007) Pitch-Class Distribution and the Identification

of Key, Music Perception, 25-3, 193-212.

[8] Zenz, V. (2007) Automatic Chord Detection in Polyphonic Audio Data, diploma thesis,

Vienna University of Technology.

Page 19 of 21


Appendix A

The formulas involved in SIFT algorithm - adapted from the paper by Markel [5] is described as

follows:

1. The output xn after putting input sequence sn through the low-pass filter is obtained by:

xn = a1un + a2xn−1 + a3xn−2

un = a3sn + a2un−1

where xn = 0 and un = 0 if n < 0 and

a1 = 1− e−α1T

a2 = e−α1T

a3 = 1− 2e−α2T cosβ2T + e−2a2T

a4 = 2e−α2T cosβ2T

a5 = −e−2a2T

α1 = (0.3572)2πfc

α2 = (0.1786)2πfc

β2 = (0.8938)πfc

fc = 0.8kHz

T = 0.1ms

2. Assuming xn is at a 10kHz sampling rate, the downsampled sequence wn at 2kHz is then

created by taking every fifth sample of xn.

Note: From step 3 onwards, wn is analyzed frame-by-frame, each frame has 64 samples and

each frame shift takes 32 samples. After the analysis, we can obtain a sequence of f0 that is

consistent with the pitch contour of sn. Therefore it is assumed in step 3-5 that the input is

a 64-sample sequence and the final output is collected to a list of pitches.

3. The coefficients to the 4th inverse filter is computed by the autocorrelation equations

4∑i=1

aipi−j = −pj

with the coefficients calculated by pi =∑N−1−j

n=0 wnwn+j with j = 0, 1, . . . , 4. The filter only

Page 20 of 21


uses 4 coefficients, so it is possible to obtain a solution for ai from the set of equations

a1p0 + a2p1 + a3p2 + a4p3 = −p1a1p1 + a2p0 + a3p1 + a4p2 = −p2a1p2 + a2p1 + a3p0 + a4p1 = −p3a1p3 + a2p2 + a3p1 + a4p0 = −p4

The inverse filter output yn is then calculated as yn = wn +∑4

i=1 aiwn−i where wn = 0 when

n < 0.

4. f0 can be obtained from 64 samples of the autocorrelation sequence of the inverse filter

output, defined as rn =∑N−1−n

j=0 yjyj+n, with n = 0, 1, . . . , 63.

The estimated pitch is defined as the distance from r0 to the first peak r in ms.

5. For a more accurate estimation, the area surrounding peak r is interpolated by a ratio of 4

to 1. Let’s say r is obtained at position N , define γa with a = 0, 1, . . . , 8 as the interpolated

sequence surrounding rN with γ0 = rN−1, γ4 = rN , γ8 = rN+1.

The rest of γa can be computed using the simplified interpolation equations:

[γ±3/4 γ±1/2 γ±1/4

]T=

0.879124 0.321662 −0.150534

0.637643 0.636110 −0.212208

0.322745 0.878039 −0.158147

[γ±1 γ0 γ∓1

]T

We re-examine γa to get the precise peak γ at index a.

f0 of the frame in question is finally obtained then by P = (N + a−44 )/2. We can get the f0

in kHz by F0 = 1/P..

Page 21 of 21

ICT-GroupProject-Report2-NguyenDangHoa_2

Documents