An Adaptive Time-Frequency Representation with Re ...Abhijeet Tambe (Master of Science, Music Engineering Technology) An Adaptive Time-Frequency Representation with Re-Synthesis Using

1

University of Miami

An Adaptive Time-Frequency Representation withRe-Synthesis Using Spectral Interpolation

By

Abhijeet Tambe

A Research Project

Submitted to the faculty of the University of Miami in partial fulfillment of therequirements for the degree of Master of Science in Music Engineering

Coral Gables, FloridaApril, 2000

2

Abhijeet Tambe (Master of Science, Music Engineering Technology)

An Adaptive Time-Frequency Representation with Re-Synthesis Using SpectralInterpolation

Abstract of Masters Research Project at the University of Miami

Research project supervised by William Pirkle

Abstract: The conventional method for storing and transmitting digital audio is withPCM samples in the time domain. An adaptive time-frequency analysis is performed onthe audio signal and it is stored in the form of a 3-D matrix containing time, frequencyand magnitude information. The validity of this method is tested by using a spectralinterpolation algorithm to perform re-synthesis in order to compare the synthesizedsignal to the original. Among the many applications for such a representation, thepossibility of audio compression is explored. The results of applying this procedure on afew audio signals are plotted along with the results of listening tests.

3

Acknowledgements

I would like to thank the members of my thesis committee for their encouragement and support.

Will Pirkle in particular, who was my main advisor, was patient with my repeated changes of

focus in this study and always had great ideas to offer. Thank you all for everything.

I would like to acknowledge and thank my family members, who were with me every step of the

way. Ashwini, thanks for the midday phone calls and Baba, thanks for keeping the faith. Shankar

thanks machcha. Mama, Mami, Azoba, Aparna, Atul, Kaka, Kaku, Atya, Shruti, Ganesh and of

course Vaini always encouraged my scholarly efforts. Thank you all for the support.

The Dreggs are the next group of people to thank on my list. Ajay, your wisdom is beyond your

years. Archana, your dissertation was the guiding light for this work. Jatin, you are The One. Ex-

commander Vishal, I hope your commandership is restored to you someday. Ambu, your emails

are the best. Srinath’s psychotic work schedule was a constant reminder of how my life is o.k.

Santosh, I hope you find what you are looking for. Seetha, I wish you happiness. Bansi, I hope

you’re sending it. Raj, I don’t know what to say.

Last and definitely not least are my friends in Miami. Kimbo, I won’t forget your contribution to

society and Lausims in general. Ralph, we did it man, we finished our theses! Ryan, you are the

best, homeboy. Mathilde, I’m freezing. Abby, I’m hooooooome! Nele, I hope you swim with the

dolphins. Ashwanth, you have to ‘SENNNND IT’. Siddarth, you shall execute the perfect cover

drive. Karthik, you are also the best, homeboy. Alex, you’re crazy man, don’t change. Ali, thanks

for everything. Mike and Amy, you guys are so cool. James, thanks for introducing me to club

bridge. Jay you are my mentor for life. Daisuke, you shall design the ultimate loudspeaker.

Margarita, thanks for introducing me to Colombian food. Anna Maria, thanks for attempting to

teach me how to dance Salsa. Anu, thanks for the monitor. Sylvia, may you be Dr. Sylvia very

soon. Jasmin, you already know everything, I don’t know what to say. Thanks Paul Griffith for

introducing me to concert recording. Thanks Kai for doing the electronic parts in that cool song

we recorded. Chris you’re a great drummer, I hope you shine. Naseer, thanks for the Isuzu

Trooper. Lorna and Shayne, I’m glad I met you. Thank you Titanic for some memorable

Wednesday nights. Peggy you are the best. Zammy Zammy, Benny-looking-maan. Thanks all of

you, I hope we stay in touch.

4

Table of Contents

Chapter 1 – Introduction 5

Chapter 2 – Analysis/Synthesis Techniques 7

2.1 – Digital signals and the sampling theorem 72.2 – Frequency domain transformation 82.3 – Time/Frequency analysis 192.4 – Filter banks 222.5 – Synthesis techniques 272.6 – Classical and modern studies of timbre 302.7 – The Proposed Scheme 31

Chapter 3 – Psychoacoustics 333.1 – Auditory analysis 333.2 – Frequency resolution of the human ear 343.3 – Masking 363.4 – Critical bands 38

Chapter 4 – Non orthogonal signal decomposition 40

4.1 – Theory and Computation 404.2 – Summary 43

Chapter 5 – Adaptive Time/Frequency Analysis 445.1 – Why Adaptation? 445.2 – Adaptation 495.3 – Improved Adaptation Algorithm 585.4 – Time/Frequency Representation 73

Chapter 6 – Synthesis Using Spectral Interpolation 746.1 – Why Spectral Interpolation? 746.2 – Frame to Frame Peak Matching 816.3 – Synthesis Using Cubic Phase Interpolation 87

Chapter 7 – Results and Conclusions 947.1 – Results for Classical Music 957.2 – Results for Guitar Chord 977.3 – Results for Clarinet Patch 987.4 – Results for Speech 997.5 – Listening Tests 1017.6 – Conclusions 1057.7 – Improvements 1067.8 – Summary 107

5

Chapter 1 - Introduction

Digital audio data is generally stored in the form of digital samples in time. These

samples are obtained by sampling the continuous analog signal ( )tf in time at equal time

intervals T, to obtain the digitized version of the signal ( )nTf in time. The digital signal

is stored as a sequence of samples in time and can be considered as a discrete-time signal.

The signal stored is a function of time, and is said to be in the time domain. It is also

possible to view the signal in some other domain by applying an appropriate transform on

it. As an example, the Fourier transformation, if applied on a function of time, results in a

function of frequency thereby enabling a frequency domain view of the signal. In general,

if the transform is chosen properly the representation of the signal in the other domain is

complete and no information is lost. The advantage of representing the signal in some

other domain is that it reveals other information inherent in the signal that might not be

obvious in the time domain.

For viewing the signal in some other domain, an analysis has to be performed on the

signal to extract the critical components required for this alternate representation. After

this is done, the necessary signal processing functions are performed on this transformed

signal. In order to convert the signal back into the time domain, a re-synthesis has to be

performed. This process is generally an inverse of the analysis procedure. This sequence

is shown in figure 1.1.

)(nf )(mF )(mFn )(nfn

Figure 1.1 – General analysis/synthesis model for audio signals

Examples of this analysis/synthesis method include compression techniques such as

MPEG (Motion Picture Expert Group), which achieve data reduction by finding and

Analysis Signal Processingfunctions

Synthesis

6

discarding redundancies. With respect to figure 1.1, in the case of some MPEG layers, the

analysis block involves using filter banks and later transforming the signal into the

frequency domain. The signal processing function is the process of finding redundancies

in the frequency domain and discarding them to reduce data. This work is usually done

on the encoder side and at the end of this process, the signal is represented at a much-

reduced bit-rate. The synthesis involves using an inverse transform to reconstruct the

signal in the time domain. This is done by the decoder, which simply reconstructs the

signal from its coded format and plays it back.

Any audio signal consists of frequency components whose magnitudes and actual

frequencies vary dynamically with time. An alternate representation of the same signal,

which tracks these frequency components and shows how they vary with time, could be

very useful. This could be used for applications ranging from audio compression to pitch

shifting as well as time compression and expansion. The motivation for this study is to

develop such an alternate time-frequency representation for audio signals. The validity of

this method depends on whether it encodes enough information to reconstruct a

synthesized signal that is very similar in quality to the original signal. A synthesis method

is developed to reconstruct the time signal based on the new representation, so that the

original signal and the synthesized version can be compared. From an applications

standpoint, the possibilities for audio compression using this method are explored.

Broadly this study can be classified into four sections. The first is the theory section,

which is presented in chapters 2 and 3. These chapters develop the relevant theory, which

is used, in later chapters. The second section deals with the analysis and time-frequency

representation of the signal, and is covered in chapters 4 and 5. The third section deals

with the re-synthesis of the time signal from this new representation and is described in

chapter 6. The fourth and final section contains the results and conclusions, which are

documented in chapter 7.

7

Chapter 2 – Analysis/Synthesis Techniques

This chapter starts with a review of digital audio signals and the Fourier analysis method

for transforming signals into the frequency domain. Current time-frequency analysis

methods are described along with their limitations. Various synthesis techniques based on

analysis are then discussed followed by a review of classical and modern studies of

timbre, which clarifies why the proposed scheme is necessary. The chapter ends with a

brief description of the proposed scheme.

2.1 DIGITAL SIGNALS AND THE SAMPLING THEOREM

Digitizing a signal is the process of converting a continuous (analog) signal ( )tf into a

discrete-time signal ( )nTf by sampling the continuous signal at equal intervals of time

T. The sampling frequency sF is the inverse of the sampling period T . According to the

sampling theorem, no information is lost in this conversion as long as the continuous

signal is band-limited and the sampling frequency sF is at least twice the signal

bandwidth. The term band refers to a band of frequencies. The term band-limited implies

that the range of frequencies is limited and the term bandwidth refers to the actual range

of frequencies present in the signal. If this condition is not met, an error known as

aliasing occurs. After this conversion, the signal can be represented and stored as a

sequence of sample values as shown in figure 2.1. The number of samples stored per

second is the same as the sampling rate. Quantization is a process where each sample is

given a discrete amplitude value and is represented by a digital “word,” which has a

certain number of bits. The higher the number of bits used, the lower the noise in the

signal.

8

Figure 2.1.1 Analog signal – Digital signal conversion

Sampling rates in audio are measured in samples per second or Hertz. Common sampling

rates in audio are 8 kHz, 11.025 kHz, 16 kHz, 20 kHz, 32 kHz, 44.1 kHz, 48 kHz, 96

kHz, and 192 kHz. The audio signal must be band limited to half the sampling frequency

before sampling to prevent aliasing distortion. Some common word lengths in digital

audio are 8 bits, 16 bits, 20 bits, and 24 bits. Reducing the sampling rate or the number of

bits is equivalent to reducing the amount of data required to store the signal but also

implies that the digital representation will not be as accurate.

2.2 FREQUENCY DOMAIN TRANSFORMATION

In order to analyze the behavior of a signal, it is often necessary to view it in the

frequency domain. The most well known frequency domain transformation is the Fourier

Transform. The Fourier transformation and its properties in a purely analog sense are

now discussed followed by another discussion on the implications of using it in the

digital domain. The Fourier transformation is defined as:

∫+∞

∞−

−= dtetfjwF jwt)()( (2.2.1)

Here )(tf is the signal as a function of time in the time domain and )( jwF is the signal

as a function of complex frequency in the frequency domain. Since )( jwF is complex, it

9

has both magnitude and phase and these can be plotted separately. This is also known as

the spectrum of the signal since it reveals information about the frequency characteristics

of the signal. The Fourier Transform shown in equation 2.2.1 has limits between -∞ and

+∞ . This implies that the signal goes on for all time. In practice, this is neither possible

nor convenient for normal calculations, which involve analyzing finite length signals. In

fact, small portions of the signal during a certain frame in time are usually extracted for

analysis. This means that the signal being analyzing is time-limited and the limits of

integration can be changed (from +∞ to -∞ ) to the time limits of the window being

analyzed. This signal truncation causes a smearing or spreading effect in the spectrum of

the windowed signal and this is called windowing effect. If a small part of the signal is

extracted over some frame in time, the extracted portion is equivalent to multiplying the

actual signal with a “rectangular window” function whose value is 1 during the portion of

time corresponding to the extracted signal and zero at all other times. In general, the

signal can be multiplied with a number of different window functions and this is given by

the following equation:

( ) ( ) ( )twtftf w = (2.2.2)

( ) ( ) ( )jwWjwFjwFw *= (2.2.3)

Here, f(t) is the original signal, w(t) is the window function and )(tf w is the windowed

signal. It can be shown that using a multiplication operator in the time domain is

equivalent to using a convolution operator in the frequency domain and vice versa. Using

this result, it can be seen that the spectrum of the windowed signal is equal to the

spectrum of the original signal convolved with the spectrum of the window function. This

is significant because it implies that the spectrum of the window function plays a very

important role in the spectrum of the windowed signal. In order to analyze the effects of

the term W(jw) in equation 2.2.3, consider the case of a rectangular window.

( ) ( ) ( )2/2/ 11 ww TtUTtUtw −−+= −− (2.2.4)

10

)(1 xtU +− is a unit step signal with a step at time t=- x and wT is the width of the window

in time. If it is assumed that the window is centered around t=0,

( ) ( ) ( )2/

2/sin.}{

w

ww wT

wTTtwFourierjwW == (2.2.5)

The spectrum W(jw) of the rectangular window can be shown to be in the form of

( ) xx /sin . This is otherwise known as the sinc function and is shown in figure 2.2.1a. The

largest lobe is around the center and is known as the main-lobe, and the smaller lobes on

both sides of it are known as side-lobes. There are various windows that can be used

other than the rectangular window such as the Hamming, Hann, and Blackmann

windows. Whereas the rectangular window puts an equal weight on all parts of the signal

in the window (with its rectangular shape), these other windows apply different weights

at different points in time during the window. Typically, the shape of a window is based

on some commonly known functions such as the cos or 2cos function. The spectra of all

the window functions tend to resemble the sinc function. They basically differ in terms of

the width of the main-lobe and suppression of side-lobes.

Ideally, it is desirable to have a window whose spectrum is as close to an impulse as

possible, since convolving with an impulse results in the same original spectrum.

Unfortunately, a window with an impulse in the frequency domain is equivalent to a

window with a value 1 for all time in the time domain, which in turn is equivalent to

considering the signal for all time, which cannot be done. The moment a finite length

signal is considered for analysis, it means implicitly that the window is of finite length

and therefore its spectrum is non-impulsive. This in turn means that the “true” spectrum

of the signal is never seen due to the fact that the true spectrum is convolved with the

non-impulsive window spectrum. The true spectrum refers to a spectrum where each

frequency that is present in the signal is seen as an individual impulse in the frequency

domain. The convolution operation centers the window spectrum at each of these

frequencies and the result of the operation is the sum of these centered window spectra.

11

In the case of a rectangular window, the window spectrum is a sinc function as

demonstrated earlier. Since the window spectrum has a non-zero bandwidth, there could

be some overlap between these window spectra centered at various frequencies. This

results in addition and cancellation of adjacent spectra, thus giving rise to a distorted

version of the original spectrum.

In many applications, it is important to be able to extract the actual frequencies present in

the signal during any given time frame. For a complex audio signal with multiple

frequencies at any given time, windowing results in a smeared spectrum in which it is

difficult to extract the exact frequencies present during that time frame. In this case, these

frequencies can be found by extracting the peaks in the frequency spectrum. In order to

do this successfully it is desirable to have peaks that are as well defined as possible and

can be extracted using a peak detection algorithm. So, choosing the right window

function becomes vital.

There are some factors to be considered before choosing a window function. For a

window of length wT seconds, the frequency separation zf between the zero crossings of

adjacent side lobes is wT/1 Hz for most windows. It is preferable to have the main lobe

width be as small as possible and the side lobes as suppressed as possible to approximate

an impulse spectrum. The width of the main lobe is zf.2 for a rectangular window, which

is relatively small, but the side lobe suppression is very poor. In fact, the maxima of the

first side lobe is only 13dB (decibels) below the maxima of the main lobe. Better side

lobe suppression can be achieved at the cost of a wider main lobe by using other types of

windows. For example, a Hamming window has side lobe suppression of 43dB and more,

but the main lobe is twice as wide as that of a rectangular window. Choosing a suitable

window becomes a trade off between the width of the main lobe and suppression of the

side lobes and depends on the application. In general, the Hamming window tends to be

popular due to it’s high side lobe suppression and relatively narrow main lobe.

Consider what happens when the Fourier transform is used on a digitized signal. A digital

signal consists of a sequence of discrete values, which result from sampling the signal as

12

discussed in chapter 2.1. Equation 2.2.1 refers to a continuous signal. In order to apply a

similar transform on a discrete signal )(nTf :

( ) ( ) jwnTN

N

enTfjwF −−+

−∑=

12/

2/

. (2.2.6)

Equation 2.2.6 is a used for decomposing an N point signal into its frequency

components. Since it is complex, it provides magnitude as well as phase information at

each frequency. It is seen from equations 2.2.6 (and equation 2.2.1 for that matter) that

the Fourier Transform is basically a correlation function which correlates the signal

( )nTf (or )(tf ) with a sinusoid of a given frequency w by multiplying the two signals

and finding the total energy in the output signal. When this is done for a set of

frequencies, )( jwF corresponding to each of those frequencies can be found using

equation 2.2.6 by substituting each of those frequencies for w and finding the result. If

an infinite number of frequencies are used to decompose a time-limited signal, the

Discrete Time Fourier Transform (DTFT) is obtained, which is a continuous function as

shown in figure 2.2.1.

In the digital domain, there are no continuous functions. Only discrete samples can be

stored. So, this spectrum (DTFT) needs to be sampled at certain frequency values. The

question is how often this spectrum needs to be sampled, so that the signal can be fully

represented. It can be shown that for an N point signal, this spectrum needs to be sampled

only at N equally spaced frequencies between 0 Hz and sF Hz to fully represent the

signal and satisfy the condition known as perfect reconstruction (Rabiner & Schafer,

1978). When N equally spaced frequencies between 0 and sF are used to decompose an

N point signal, the Discrete Fourier Transform (DFT) is obtained, which is the sampled

version of the DTFT. The frequencies at which the spectrum is sampled are known as

frequency bins. Any time limited signal that is a pure sinusoid has a spectrum that is of a

sinc function nature, with its main lobe centered at its actual frequency. If an N-point

DFT analysis is performed on this signal, the frequency bins are fixed at NnFs /. ,

13

1,...,0 −= Nn . Varying spectra from the DFT analysis can be seen depending on how

close the actual frequency of the signal is to a frequency bin.

Figure 2.2.1 – DTFT of a rectangular pulse

(Taken from Proakis & Manolakis, 1988)

As an example to demonstrate this, consider figure 2.2.2. All three figures show the

continuous DTFT function as well as the sampled points, which give the DFT of a pure

sine wave with a rectangular window applied on it. All three figures show the sinc

characteristics of the spectrum (DTFT). In the discrete world, where the spectrum is

calculated only for certain frequency bins, an interesting thing happens. If the frequency

of the sinusoid happens to line up exactly with a frequency bin, all the zero crossings of

the sinc function line up exactly with frequency bins. The result is that significant energy

is seen at the bin corresponding to the frequency of the signal and all other bins appear to

have zero energy. This is shown in figure 2.2.2.a. On the other hand, if the frequency

does not line up with a bin, the energy seen in the other bins is as shown in figure 2.2.2.b.

This phenomenon is known as spectral spreading and figure 2.2.2.c shows the worst case

of spectral spreading where the frequency of the sine wave is exactly halfway between

two bins.

14

Though in all three cases the discrete spectrum is accurate in the sense that the condition

for perfect reconstruction is met, for the purposes of analysis, the case shown in figure

2.2.4a is the most useful since it appears as though there is only one frequency present in

the signal. Additionally, the peak frequency bin in this case corresponds exactly to the

frequency of the signal. The case shown in figure 2.2.4c can lead to confusion during

analysis since it appears that there are other frequencies present in the spectrum. Besides

this, the frequency bin corresponding to the highest energy does not correspond exactly to

the actual frequency of the signal because the actual frequency is somewhere between

two adjacent bins.

Figure 2.2.2(a) DFT of windowed signal with frequency centered at a bin, (b) DFT of

windowed signal with frequency not on a bin and (c) DFT of windowed signal with

frequency halfway between 2 bins

(Taken from Lindquist, 1989)

The issue of breaking up a signal into its sinusoidal components using the DFT has been

discussed above. In this case, the basis functions (which are the types of components that

the signal is being broken up into) are sinusoidal. There are other discrete transforms that

can be used, which have a different set of basis functions (e.g., square waves), and break

15

up the signal into a different set of fundamental components (e.g., square waves). In

general, discrete transforms have the form

( ) ( ) ( )nfnmKmFN

n∑=

=1

, , m=1,…, M

(2.2.7)

( ) ( ) ( )∑=

−=M

m

mFmnKnf1

1 , , n=1, …, N

(Taken from Lindquist, 1989)

Here, ( )nmK , is known as the kernel and ( )nmK ,1− is known as the inverse kernel or the

set of basis functions which constitute the transform. If the kernel set is complete (for

perfect reconstruction), any ( )nf can be expressed as the linear sum of the weighted

basis functions. This is shown in matrix form in equation 2.2.8. Here, the K and J

matrices are inverses. Equation 2.2.7 and equation 2.2.8 are equivalent.

(2.2.8)

In the case of the DFT, the kernel set is complete if an N point signal is decomposed with

a set of at least N frequencies equally spaced between DC and sF . The DFT can

otherwise be expressed as:

( ) ( ) NnmjN

Nn

enfmF /212/

2/

. π−−+

−=∑=

16

( ) nmN

N

Nn

Wnf .12/

2/∑

−+

−=

= (2.2.9)

where NjN eW /2π−=

This can be shown in matrix form in equation 2.2.10. In equation 2.2.10, only the power

or exponent nm of nmMW appears in the transformation matrix or the kernel.

(2.2.10)

It can be seen that each row of the transformation matrix corresponds to a particular

frequency. Also, to decompose an N point signal into N frequencies, 2N complex

multiplies are required. Because of certain symmetry that the DFT matrix exhibits when

N equally spaced frequencies are chosen, the resulting transformation matrix has many

redundancies and by using decimation techniques, the number of complex multiplies can

be reduced to NN 2log . This implementation is known as the Fast Fourier Transform

(FFT) and is usually the method used to calculate the DFT of a signal in commonly

available software. The FFT can be calculated relatively quickly, in real time on today’s

digital signal processing hardware.

The FFT however has the same disadvantages as the DFT. The signals used to

decompose the signal are a function of the sampling frequency as well as the number of

frequency bins desired. Without modifying these two, the frequencies are not selectable.

A simple sine wave whose frequency does not fall on one of the frequency bins will

produce a spectrum with energy spread to many frequencies. This phenomenon is known

as spectral spreading and was discussed earlier. Figure 2.2.3 illustrates this point. Figure

17

2.2.3a shows the 1024 point FFT of a 500Hz sinusoid sampled at 8000 Hz. Since one of

the frequency bins used in this decomposition is exactly at 500Hz, the FFT shows almost

all its energy concentrated at that one bin. This is because all the zero crossings of the

sinc function in the spectrum line up exactly with frequency bins. On the other hand,

figure 2.2.3b shows the FFT of a sine wave with frequency 503.9 Hz. This frequency

happens to be exactly halfway between two frequency bins (500Hz and 507.8125 Hz)

with the result that the FFT shows significant energy present in many other bins. In a

strict sense, this is true because if sinusoids with the magnitudes and phases as indicated

by the FFT are generated and added up, the original signal is again obtained. But it would

be much more useful to know that there is a single frequency component at 503.9 Hz

rather than many frequency components at different bins as indicated by the FFT.

Figure 2.2.3 (a) FFT of 500Hz and (b) FFT of 503.9 Hz

Also, the FFT produces good results only for signals that tend to stay stationary. For

signals that vary rapidly with time, the FFT is not able to show the finer details of what is

happening within the analysis time frame. Figure 2.2.4a shows a typical audio signal in

the time domain and figure 2.2.4b shows its FFT. Although the FFT accurately shows

that most of the energy in the signal is present at lower frequencies, there are many finer

details in the time domain which it is unable to represent. The FFT only represents the

average behavior of each frequency in this time frame or window. For this reason, it is

common to study an audio signal using a series of short windows in time, and performing

18

Fourier analysis on each of them. This is known as a Short Term Fourier Transform

(STFT).

Figure 2.2.4 (a) Typical Audio signal in time and (b) its FFT

The difference between using a long window and a short window is a tradeoff between

good spectral resolution and good time resolution. This has been clearly explained by the

Uncertainty Principle (Cohen, 1995), which states the well-known mathematical fact that

a narrow waveform yields a wide spectrum and a long waveform yields a narrow

spectrum. Ackroyd states, “if the effective bandwidth of a signal is W then the effective

duration cannot be less than about 1/W (and conversely)….” In fact, the time-bandwidth

product is a constant.

What this means in signal processing terms is that the smaller the duration of the window

used to analyze the signal, the larger the effective bandwidth of the section of the signal

under analysis. As an example, if a windowed portion of a pure sinusoid is under

analysis, it was shown earlier that its spectrum has a sinc function nature with the main

lobe centered at the frequency of the sinusoid. The widths of the main lobe and the side

lobes are inversely proportional to the length of the window. The longer the analysis

window in time, the narrower the lobes become and conversely, the shorter the analysis

window, the wider the lobes become. This is significant especially with respect to audio

signals, which have multiple frequencies present in a given time frame. The longer the

analysis window, the narrower the lobes for each separate frequency become thereby

19

decreasing the interaction between adjacent frequency peaks. So, with long analysis

windows, good spectral resolution is obtained with well-defined frequency peaks, but the

time-resolution is poor since the spectrum tends to reflect the averaged value over a long

period of time for each frequency. When short analysis windows are used to increase

time-resolution, the spectral resolution is decreased due to the larger lobe width leading

to increased interaction. In speech and audio processing, it is common to use window

lengths between 5ms and 30ms. The choice of the type of window used depends on how

important side lobe suppression is compared to narrow main lobe width, since this is

where the trade off lies for different types of windows.

2.3 TIME-FREQUENCY ANALYSIS

Audio signals typically vary with both time and frequency. In order to accurately assess

what the frequency components of the signal are, and also how they vary with time, it is

necessary to perform an analysis that shows a time-frequency distribution.

As seen in the previous section, the FFT of a relatively long signal in time tends to

average out the effects of various frequencies within that block. In other words, very little

information is obtained about the effect of frequency components that are changing

rapidly within that block of time. In order to reduce this effect the signal is divided into

smaller blocks in time and each of these is analyzed separately. The idea is to reduce the

size of the block to a small enough length in time such that frequency information within

that block is almost stationary. Once each block is analyzed in this way, the Fourier

transformations of each block in time can be lined up to get a better representation of the

time-frequency behavior of the signal. To demonstrate the advantage of using this

method, consider figure 2.3.1. Figure 2.3.1a shows a chirp signal, which is a sinusoid

whose frequency increases linearly in time. If a Fourier transform is performed on this

signal, the result is the spectrum shown in figure 2.3.1b.

20

Figure 2.3.1(a)– Chirp Signal in Time and (b) its FFT

This shows a near flat spectrum and is the expected behavior for the Fourier transform

since it indicates that all frequencies have equal energy content within the signal. But it

does not indicate that the signal has a frequency increasing linearly in time. The spectrum

is actually similar to that of white noise as well as that of an impulse. In order to

understand how the analyzed signal is different from white noise or an impulse a time

frequency distribution as shown in figure 2.3.2 is desirable.

The distribution shown in figure 2.3.2 shows frequency plotted as a function of time, with

intensity represented in gray scale. This figure shows the ideal time-frequency

distribution of the chirp signal because it shows frequency increasing linearly with time

with equal intensity at all times. It is not possible to obtain such an ideal time-frequency

distribution using current analysis techniques. This is because the time-frequency

distribution shown above implies perfect resolution in both time and frequency and this

violates the uncertainty principle (Cohen, 1995). When using Short Time Fourier

Transforms to calculate the time-frequency distribution of a signal, it is common to use a

series of windows to analyze the signal. The choice of how short or long each window is,

the type of window, and with the amount of overlap used, depends on the type of signal

21

analyzed (e.g., voice, music etc.), the application (e.g., real time, non-real time etc.), and

the computational power available (e.g., memory, MIPS etc).

Figure 2.3.2 Contour plot for the ideal time-frequency distribution of a chirp signal

There are many existing time-frequency distributions. Although it is not possible to

obtain the perfect time-frequency distribution in figure 2.3.2, these distributions approach

the ideal case. The spectrogram is the most basic of time-frequency distributions. It is

calculated by performing Short Term Fourier Transforms on consecutive windows and

combining the results to give a plot similar to the one in figure 2.3.3, which shows the

spectrogram of the chirp signal discussed earlier. The domain is the time axis, the image

is the frequency axis and the magnitude is grayscale. It is seen that it provides a much

better analysis of the chirp signal than the FFT performed over its entire length (shown in

figure 2.3.1b). The spectrogram accurately reveals that the frequency of the signal

increases linearly with time, but it also shows a lot of spectral spreading at any given

point in time, due to windowing effects discussed in the previous section.

There are other methods such as the Wigner distribution, the Choi-Williams distribution

and wavelet analysis, which can be used to produce different types of time-frequency

distributions. These are not discussed since they are not relevant to this study.

22

Figure 2.3.3 Spectrogram of Chirp Signal

2.4 FILTER BANKS

Filter banks are used to break up a signal into the desired number of frequency bands. By

doing this, the flexibility to adjust certain parameters used in analysis is retained,

depending on which band is being analyzed. MPEG algorithms use filter banks to split

the signal into bands that resemble human ear critical bands (chapter 3.3) and then apply

a psychoacoustic model to each band, to detect and discard redundant data.

Filter banks are not systems that simply consist of a bank of simple band-pass filters that

break up the signal into the required bands. If this were the case, redundancy would be

added to the data. As an example, consider an N point signal sampled at sF , which is to

23

be broken up into 10 bands of equal width. If a bank of simple band-pass filters is used to

do the job, the result is 10 N point signals, each being sampled at sF . This would mean

that there were a total of 10xN points but no more information than before. This is

because each band uses only 1/10th of the total spectral bandwidth that is available to

represent it. Consider the lowest frequency band, which is being highly over-sampled at

sF . It only has information between 0 and π /10. According to the Nyquist theorem, it

can easily be sampled at a rate of 10/sF without any aliasing distortion thereby reducing

data by a factor of 10 for that band. Similarly all other bands can also be down-sampled

by a factor of 10 without losing any information, thereby resulting in a total of N points

which were started with. How this is achieved using filter banks is now explained.

Filter banks are multirate systems that not only divide the signal into the required number

of frequency bands, but also change the sampling rate of each band depending on its

frequency bandwidth (Vaidyanathan, 1993). Filter banks use decimation (or down-

sampling) during analysis and expansion (or up-sampling or interpolation) during re-

synthesis. An M-fold decimator reduces the sampling rate by a factor of M and an L-fold

expander increases the sampling rate by a factor of L. These are shown in figures 2.4.1(a)

and (b).

Figure 2.4.1 (a) – M-fold decimator (b) – L-fold expander

(Taken from Vaidyanathan, 1993)

In this study, the filter bank is only used for analysis and not reconstruction, so the main

concern is decimation. The M-fold decimator simply outputs every Mth sample of the

signal and discards all samples in between. This is shown in figure 2.4.2 and is equivalent

to reducing the sampling rate by a factor of M. Mathematically, this is represented as

shown in equation 2.4.1.

24

Figure 2.4.2 – Demonstration of decimation for M=2


( ) ( )MnxnyD = (2.4.1)

The decimator not only changes the sampling rate of the signal but also the spectrum. The

relation between the spectrum of the decimated signal ( )nyD and the original signal ( )nx

is as follows:

( ) ( )∑−

=

−=1

0

/)2(1 M

k

MkwjjwD eX

MeY π (2.4.2)


This can be graphically interpreted as follows: (a) stretch ( )jweX by a factor M to obtain

( )MjweX / , (b) create M-1 copies of this stretched version by shifting it uniformly in

successive amounts of π2 , and (c) add all these shifted and stretched versions to the

unshifted version ( )MjweX / , and divide by M. This is shown in figure 2.4.3

25

Figure 2.4.3 Demonstrating the frequency domain effect of decimation with M=3


This means that a signal can be down-sampled by any required factor. But certain

conditions have to be fulfilled to avoid aliasing effects, which occur due to this stretching

and adding of spectra. Consider the case where a signal of sampling rate sF is low-pass

filtered and high-pass filtered at π /2 to yield two signals both of which retain the original

sampling rate of sF . The effects of down-sampling both signals by a factor of M=2 are

now analyzed. Down-sampling by a factor of 2 is equivalent to reducing the sampling

rate by a factor of 2. The low frequency band signal must have a band limit at π /2. This

is because its spectrum gets stretched by a factor of 2 and added to shifted versions of

itself centered at multiples of 2π . If the original low passed signal has a bandwidth of

more than π /2, the new signal will stretch to beyond π and get added to the shifted

versions thereby resulting in aliasing. Even the high passed signal can be down-sampled

by a factor of 2 as long as its lower frequency limit exceeds π /2. Again in this case, its

26

spectrum gets stretched and added to shifted versions of itself. As long as the original

signal is band-limited to between π /2 and π , no aliasing occurs. All that happens is that

its spectrum in the high frequency region between π /2 and π now gets mirrored between

0 and π /2 and stretched so that it is between 0 and π in the new signal. As an example,

if a chirp signal rising from 2 kHz to 4 kHz, originally sampled at 8 kHz is down-sampled

by a factor of 2, the result is a chirp signal with frequency falling from 2 kHz to DC.

There is no aliasing or information loss, but the resulting signal does not sound at all like

the original. The original signal can be recovered though, by using a system for

conversion, which is discussed in later chapters.

Figure 2.4.4 – Analysis filter bank

(Taken from Kotvis, 1997)

A typical filter bank uses the technique described above to divide the signal into the

required number of bands. Figure 2.4.4 shows an analysis filter bank implementation

where the signal is divided into 10 octaves. Assume that the signal ( )nx has N points, is

sampled at sF and is band-limited to sF /2. Here 1H and 0H are complimentary high

pass and low pass filters both of which have cutoff frequencies at π /2. Both are also

followed by a down sampling operator. At the first stage, the cutoff frequency of π /2

corresponds to sF /4. The high passed version of the signal, which represents content

between sF /4 and sF /2 is down-sampled by a factor of 2 to yield ( )ny10 . ( )ny10 contains

only N/2 points after the down-sampling operation. The low-passed signal is also down-

27

sampled and contains frequency information between 0 and sF /4 and has N/2 points with

an effective sampling rate of sF /2. This signal is again filtered using the same pair of

filters as before, but in this case the cutoff frequency of π /2 corresponds to sF /8 since

the signal has already been down-sampled once. Now the high passed version of this

signal containing frequency information between sF /8 and sF /4 is again down-sampled

to yield ( )ny9 which contains only N/4 points. This process is continued till the signal is

successfully divided into 10 bands. Each band contains only half the number of points as

the previous band due to down-sampling (except the last band 1y ). The total number of

points added up from all the signals 1y to 10y is N.

It is important to have a proper choice of the pair of filters H1 and H0. They need to have

narrow transition bands and cutoff frequencies adjusted to minimize aliasing effects. In

chapter 3, a similar filter bank to the one shown above is used and the exact parameters

used in the filter bank and the filters are described.

2.5 SYNTHESIS TECHNIQUES

Sound synthesis is the generation of a signal that creates a desired acoustic sensation

(Dodge & Jerse, 1997). In computer music, there are various synthesis techniques that are

used for used for generating sounds that imitate real instruments. In fact, in computer

music, the term ‘instrument’ refers to an algorithm that realizes or performs a musical

event. In this study, analysis based synthesis techniques are used. Since the aim is to

obtain certain parameters from the analysis and then re-synthesize the music based on

these parameters, a brief review of some synthesis techniques is presented in the

following sections.

The Oscillator

The unit generator that is fundamental to almost all computer sound synthesis is called

the oscillator. An oscillator generates a periodic waveform that can be of various types

28

such as sinusoidal, square or saw-tooth. The controls applied to an oscillator determine

amplitude, frequency and phase of the waveform it produces. A flowchart symbol for an

oscillator with its various controls is shown in figure 2.5.1.

Figure 2.5.1 Flow chart symbol for an Oscillator

(Taken from Dodge & Jerse, 1997)

Additive Synthesis

For a certain tone, if the spectral components and their magnitudes are known, then each

of these components can be modeled using an oscillator with the appropriate frequency

and phase functions as well as an amplitude envelope. In this method, each partial of the

desired tone is represented by one oscillator with the above mentioned parameters. Thus,

adding up the output from each oscillator can generate the desired tone. This is known as

additive synthesis and is shown in figure 2.5.3. The amplitude and frequency parameters

of the real tone can be obtained by using some of the (time-frequency) techniques

mentioned earlier in this chapter. The name Fourier re-composition is sometimes used to

describe synthesis from analysis, because it can be thought of as the reconstitution of the

time varying Fourier components of the sound. Additive synthesis has proved capable of

generating tones that are virtually indistinguishable from the original tone even by trained

musicians. The only problem is that sometimes a large number of oscillators are required

to generate the given tone.

29

Figure2.5.3 Basic configuration for additive synthesis


Synthesis Using Spectral Interpolation

The techniques described above can be used to recreate very natural sounding tones. The

parameters of any tone being analyzed though are constantly changing and therefore the

parameters of each oscillator need to keep changing, with each frame in time. If the

signal is synthesized frame by frame, based on the parameters for each frame, and then

the synthesized frames are simply arranged in the correct order, there may be

discontinuities in the synthesized signal at frame boundaries. One way to solve this

problem is to cross-fade between frames. Another possible method is to use simple linear

interpolation for each parameter from frame to frame. Linear interpolation works well for

amplitude envelopes but can create discontinuities in the signal if used directly for

frequency. A solution for the synthesis problem is presented in later chapters.

30

2.6 CLASSICAL AND MODERN STUDIES OF TIMBRE

The classical theory of timbre is based on the Helmholtz model. Hermann Von Helmholtz

laid the foundation for the modern theory of timbre in his 19th century work, On the

Sensations of Tone. He characterized tones as consisting of several waveforms of

different frequencies enclosed in an amplitude envelope consisting of three parts: the

attack, steady state, and decay portions. An interesting part of this study was that he

concluded that all the partials (spectral peaks) in a tone have the same attack, steady state,

and decay times. Modern studies show that this is usually not the case.

Jean Claude Risset, in his 1966 work, Computer Study of Trumpet Tones, employed an

FFT based algorithm to gain information about the spectral characteristics of a trumpet

tone (Dodge & Jerse, 1997). Whereas Helmholtz and other researchers before him had

applied a Fourier Transform to the steady state section of the tone, Risset applied a series

of windows on his data and performed the FFT on each window. Thus, he was able to get

more accurate time-frequency information. He used windows that were between 5 and 50

ms for his analysis, and what he found was very different from what Helmholtz had

concluded about the fundamental frequency and all its partials having the same amplitude

envelope. He found that the partials did not have the same amplitude envelope as the

fundamental and even their frequencies varied with time. Figure 2.6.1 shows the

amplitude progressions of the partials of a trumpet tone.

It can be seen from this figure that higher harmonics attack last and decay first.

Additionally, each harmonic has fluctuations in frequency during the course of the tone

(and are especially erratic during the attack), quite similar to a vibrato effect, and re-

synthesis of the tone without these fluctuations produces a discernable change in the

character of the tone. John Chowning and Michael McNabb have demonstrated the

importance of synthesizing the fluctuations in frequency of the various partials for the

output to be perceived as a fused single tone.

31

Figure2.6.1 Amplitude Progressions of the Partials of a Trumpet Tone


2.7 THE PROPOSED SCHEME

The aim of this study is to develop an alternate representation for audio signals in the

form of a time-frequency matrix. Among the many applications for this discussed earlier,

the one that is explored in this study is audio compression. In most cases, audio

compression is achieved by removing irrelevant or redundant data from the music.

Usually the original music is preserved in some way, but some model is used to provide

variable bit allocation. Since the objective of this study is to form a time-frequency

matrix, the signal first has to be divided into segments of time which are so small that the

signal can be assumed to be almost stationary within that segment or frame. A frequency

analysis is then performed on that frame and the frequencies and their corresponding

magnitudes present in that frame are extracted. If a similar analysis is performed on all

32

the frames that the signal is divided into, the spectral characteristics of each frame in time

can be obtained. This information can be arranged in two matrices of the same size, both

of which have each column representing a frame in time. The first is the “frequency”

matrix and the second is the “magnitude” matrix. Each column of the frequency matrix

represents a frame and contains the frequencies that were found in that frame. The

elements of the magnitude matrix are the magnitudes of the corresponding elements

(frequencies) in the frequency matrix. The nth column of the first matrix contains the

frequencies of the nth frame. Together, the two matrices form a sort of a 3-D time-

frequency representation of the signal, which contains a greatly reduced amount of data

compared to the signal stored in the form of samples. The signal can be later re-

synthesized using the data in these matrices.

For reasons that are described more specifically later, non-orthogonal signal

decomposition is used for performing the spectral analysis on each frame and this is

described in chapter 4. This method is only the basis for developing an adaptive

algorithm that detects the frequency peaks and their magnitudes in each frame and stores

them in the form of a time-frequency matrix as described above. This algorithm is

discussed in chapter 5. The method is flexible enough to allow for a psychoacoustic

model to be used. Some fundamental psychoacoustic phenomena are described in chapter

3. The second part of this study comprises re-synthesizing the music using these

parameters, which have been extracted. Since conventional methods prove to be

inadequate, a frequency interpolation algorithm based on a cubic equation is used for this

re-synthesis. This is described in chapter 6. Chapter 7 consists of the results of this

experiment along with the conclusions.

33

Chapter 3 – Psychoacoustics

Introduction

Psychoacoustics is a branch of psychophysics, which deals with the relationship between

acoustic stimuli and auditory sensation. It addresses the question “why we hear what we

hear” when we are exposed to a given acoustic stimulus. There are certain psychoacoustic

phenomena that occur when human beings hear sound, that have been studied intensively

and are the basis of the increasingly popular “perceptual coders.” Most of these coders

have psychoacoustic models that analyze the signal and detect those parts in it which

psychoacoustic research over the years has proved that the human ear cannot detect.

These coders then either discard this irrelevant data or code it at a very low bit rate,

thereby achieving audio compression. This chapter starts with a review of auditory

analysis in general and then presents some relevant psychoacoustic phenomena.

3.1 AUDITORY ANALYSIS

A comprehensive theory of auditory analysis did not exist before Helmholtz’s work. The

second chapter of Sensations of Tone (Helmholtz, 1863) contains a summary of what

Helmholtz considered to be the main problems of auditory analysis. He pointed out that

people have no difficulty in following the individual instruments at concerts or directing

their attention at will to the words of a speaker. It follows that different trains of sound

can be propagated without mutual disturbance and that the ear can break down a complex

sound into its constituent elements. For an explanation of how this is done, he borrowed

from Ohm’s law on hearing, which states that the ear separates a complex sound into its

sinusoidal components similar to those in mathematical analysis. Helmholtz made the

34

important addition that other components such as difference tones, which are not

physically present in the stimulus, are the result of nonlinearity in the ear. He added that

the sensation of sound was due to the stimulation of nerves in the ear and every

discriminable pitch corresponded to a particular nerve or set of nerves.

Modern day research shows that much of what Helmholtz had assumed was true, but

there are inconsistencies that are yet to be resolved. The ear is divided into the outer ear,

the middle ear and the inner ear. The outer ear consists of the pinna and the external

canal, the middle ear contains the eardrum, which leads to the entrance (oval window) of

the inner ear. The inner ear consists of the spiraling, marble sized cochlea, whose

boundaries form the basilar membrane. Vibrations on the basilar membrane are picked up

by hair cells, which form a lining on it and are then transmitted to neurons, which pass

the signal to the brain. The basilar membrane has a stiffness coefficient that varies from

very high near the entrance of the inner ear to very low near the end. This results in

regions of resonance depending on the stiffness, which are stimulated by their respective

resonant frequencies. Thus, depending on the frequencies present in an audio signal,

different regions of the basilar membrane may be stimulated, each giving rise to the

sensation of a particular pitch. These facts are well known now and can explain how the

sensation of the pitch of a pure tone occurs. However they do not satisfactorily explain

phenomena such as difference tones.

3.2 FREQUENCY RESOLUTION OF THE HUMAN EAR

There is a natural limit to an individual’s ability to establish a relative order of pitch when

two tones (of same intensity) are presented one after another. When the difference in

frequency between the two tones is very small, both tones are judged as having the same

pitch. This “difference limen” is known as the just noticeable difference (JND) in

frequency (Roederer, 1995). If the variation between the two tones exceeds the JND, a

change of pitch is detected. The degree of sensitivity to pitch changes or “frequency

resolution capability” depends on the frequency, intensity and duration of the tone in

35

question and on the suddenness of frequency change. Figure 3.2.1 shows the JND in

frequency plotted against the frequency of the pure tone in question for a typical human

subject, when the pure tone is varied slowly.

Figure 3.2.1 – JND in frequency of a pure tone

(Taken from Roderer, 1995)

It is interesting to note that below 500Hz, the resolution is constant at around 3 Hz and

starts rising only at higher frequencies. The dotted lines indicate the JND in frequency as

a percentage of the frequency of the tone in question, and it can be seen that in this

respect JND is large at low frequencies. This is the reason why bass guitarists prefer to

tune to the harmonics on their bass guitars, because the frequency of the fundamental

note is too low for good frequency resolution in the ear.

36

3.3 MASKING

The phenomenon of “masking,” which occurs in human hearing, is commonly used by

perceptual coders such as MPEG. Masking is the process by which one sound at a lower

sound pressure level (known as the maskee), is rendered inaudible to the human ear due

to the presence of another sound at a higher sound pressure level (known as the masker),

which is presented either simultaneously or offset by a small time difference.

Extensive studies have been done to find out masking thresholds and to find out how the

frequency difference between the two sounds affects the masking thresholds. The

phenomenon of simultaneous masking is discussed in this chapter since it is relevant to

the study.

Figure 3.3.1 – Curves of equal loudness


It is interesting to note that human beings are generally more sensitive to certain

frequency ranges than other frequency ranges. Tones of equal sound pressure level (SPL)

but different frequencies are judged as having different loudnesses. Thus SPL is not a

37

good measure of loudness when comparing tones of different frequencies. Experiments

have been performed to establish curves of equal loudness, taking the SPL at 1000Hz as

the reference quantity. These are shown in figure 3.3.1 and it can be seen that except at

very high sound pressure levels, human beings are typically more sensitive to frequencies

in the 1 kHz - 4 kHz range.

Figure 3.3.2 – Masking curves for a 415 Hz masker at different levels.


When presented with a pure tone at a given SPL, there is a certain minimum change in

SPL that is required to give rise to a change in loudness sensation. This is known as the

just noticeable difference (JND) in sound level and is roughly constant at the order of 0.2

– 0.4dB for the musically relevant range of pitch and loudness. Equivalently, it is the

minimum intensity that a second tone of the same frequency and phase must have, to be

noticed in the presence of the first tone, whose intensity is kept constant. This minimum

intensity is known as the threshold of masking. This is for the case of two tones of equal

frequency. Masking also takes place when two tones of different frequencies are

presented together. The masking level is determined as the minimum intensity level that

the masked tone must exceed in order for it to be “singled out” and heard in the presence

of a masking tone. This masking threshold depends heavily on the frequency difference

between the two tones. It has been found that masking effects are more predominant

38

(masking threshold is high) when frequency separation between the masker and maskee

is small. Also, in general, lower frequency tones mask higher frequency tones more

effectively. Figure 3.3.2 shows the masking curves for a 415 Hz masker. These are

basically plots of masking thresholds vs frequency for a maskee in the presence of a

given masker. Here each curve represents the thresholds given a 415 Hz masker

presented at the loudness level indicated within each curve.

The masking thresholds are quite high for maskees whose frequencies are in the vicinity

of the masker. It can be concluded that masking is effective only in a particular band of

frequencies around the frequency of the masker and masking effects for tones that are far

away in frequency can be ignored except at extremely high sound pressure levels. So far,

the phenomenon of masking has been discussed without discussing why it occurs. In this

respect, the critical band concept is commonly used to explain it and is the topic of the

next section.

3.4 CRITICAL BANDS

Fletcher (1940) proposed the critical band concept to account for many of the phenomena

of masking. He suggested that different frequencies produce their maximal effects at

different locations along the basilar membrane, and that each of these locations responds

to a limited range of frequencies. The range of frequencies to which a particular segment

responds is its critical band. In this respect, it is useful to view the basilar membrane as a

series of band-pass filters with a certain bandwidth corresponding to the critical band

(Tobias, 1970). When a tone whose frequency corresponds to a certain segment on the

basilar membrane is masked by wide band noise, only the frequencies of the noise that

fall within the bandwidth of that section (its critical band) are effective in masking the

tone. According to Fletcher, the tone is just detectable when its energy is equal to the

energy of the noise that affects that critical band. Fletcher says “When the ear is

stimulated by a sound, particular nerve fibers terminating in the basilar membrane are

39

caused to discharge their unit loads. Such nerve fibers can then no longer be used to carry

any other message to the brain by being stimulated by any other source of sound.”

Experiments have shown that when two tones whose difference in frequency is below a

certain limit are presented together, since they both stimulate the same region on the

basilar membrane, it is not possible to resolve two separate tones. Instead, a beating

sensation is heard as a result of the amplitude modulation that takes place. If the

frequencies are separated further beyond that limit but within the critical band, two tones

can be perceived but a sensation of roughness persists. It is only when the frequency

separation exceeds the critical band that the sensation of roughness disappears and both

tones sound smooth and pleasing. The critical band concept has been used to explain

many other phenomena including that of frequency dependant loudness summation of

multiple tones, and it is one of the most useful discoveries in the field of psychoacoustics

and music theory.

40

Chapter 4 – Non Orthogonal Signal Decomposition

Introduction

The first step in performing a time-frequency analysis is dividing the signal into small

frames and performing a spectral analysis on each frame. Since finite length windows are

used, the spectrum (or the DTFT) will have the smearing effect described in chapter 2.2.

In particular, for the case of a rectangular window, the DTFT of the windowed signal can

be obtained by convolving the long-term spectrum of the signal with the sinc function

corresponding to the spectrum of the window function. If the frequencies actually present

during that frame are to be extracted, they could be closely approximated by extracting

the peaks in the DTFT. To detect the frequency peaks in the DTFT, a DFT analysis

(using an FFT algorithm) could possibly be performed on the signal and peak frequency

bins in the DFT could be located. But the peak frequency bins in the DFT are not

necessarily equal to the peak frequencies in the DTFT (except for the case where the

signal contains a single frequency, which lines up exactly with a frequency bin). This is

because the bins in the DFT are already fixed regardless of the nature of the signal. The

peaks in the DTFT though, depend purely on the nature of the signal.

The aim now is to find the exact peaks in the DTFT. The DFT is not well suited to this

because of the fact that it has fixed frequency bins at equal spacing. An adaptive

algorithm is developed, which adapts to the signal and locates the exact peaks in the

DTFT. To accomplish this, a non-orthogonal signal decomposition method is used, by

which the DTFT can be sampled at any desired frequency. The following section reviews

the non-orthogonal signal decomposition method.

4.1 THEORY AND COMPUTATION

The fundamental operation in any transformation of a time domain signal into its

frequency equivalent is the decomposition of the signal into a predetermined set of

41

sinusoidal functions. These transformations offer an alternative description of the time

domain signal as a linear combination of a set of basic functions such as sinusoids,

exponential sinusoids and pure exponentials. The DFT decomposes the signal into a set

of frequencies, which are equally spaced between DC and the Nyquist frequency. It gives

a set of magnitude and phase coefficients corresponding to each frequency. These can be

converted to a set of a and b coefficients corresponding to each frequency bin such that

the signal can be reconstructed as the linear sum of each frequency multiplied by the

coefficient that was calculated. This is shown in equation 4.1.1. Based on DFT principles,

the signal at any time instant n, n=1,…, N is given by the following sum of weighted

sines and cosines:

( ) ( )( ) ( )( )[ ]∑=

−+−=2/

0

/12cos/12sinN

iii NinbNinans ππ , n=1,…, N

(4.1.1)

This relationship is used in the new decomposition technique. The basic theory behind

signal decomposition using non-orthogonal sinusoidal bases is discussed by Dologlou et

al., 1996. This study demonstrates that any signal can be decomposed using virtually any

group of sinusoids. If non-orthogonal bases are used for decomposition, any set of

frequencies can be chosen to decompose the signal. In other words, the frequencies where

the DTFT of the signal is sampled can be chosen.

The computation is fairly simple. First, N frequencies that are desired for decomposition

are selected between 0 and half the sampling rate sF . These frequencies are converted

into their digital equivalents between 0 and π using nθ =2π nf / sF , n=1, 2…, N where

nf is the frequency being converted and nθ is the digital equivalent. Next, a matrix of

size 2Nx2N is formed using sine and cosine vectors as shown in equation 4.1.2

42

( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

−−−−−−−−−−−−−−−−

−−−−−−−−

=

NNNNN N

N

N

N

A

22222

33333

22222

11111

sin2sin2sinsinsin

cos2cos2sincossin

cos2cos2sincossin

cos2cos2sincossin

11010

θθθθθ

θθθθθθθθθθθθθθθ

(4.1.2)

The matrix A contains alternate sine and cosine vectors in each column. Each column has

the first 2N samples in time corresponding to that sine or cosine vector. The last step is

the inversion of the matrix 1−= AB . The matrix B decomposes any signal into the

chosen frequency components. If the signal is given by ( )nx and it is decomposed into

the chosen frequencies to give ( )fX , the relation between the two is:

xBX .= (analysis)

(4.1.3)

XAx .= (synthesis)

Here both A and B matrices are of the order 2Nx2N. Matrices x and X are of the order

2Nx1. The decomposition is in the form of sine and cosine coefficients for each

frequency. The magnitude and phase of each frequency can be calculated using:

( ) ( )22 212|)(| nXnXnF +−= (4.1.4)

( ) ( ) ( )( )nXnXnFphase 2/12tan)( 1 −= − (4.1.5)

In principle, this transform is similar to the DFT, but there are some basic differences.

First and most important is the fact that the frequencies into which the signal is to be

decomposed can be chosen whereas a DFT has fixed bins. Second, the DFT has

43

coefficients for both positive and negative frequencies whereas this transform works for

only positive frequencies. Third, the DFT can operate on complex signals, but this

transform works only on real valued signals. Finally, the DFT forces one of the bins to be

located at DC and another one to be located at the Nyquist frequency regardless of the

number of points used. These two frequencies need not be used in the new transform.

4.2 – SUMMARY

In summary, the most useful property of this transform is that the frequencies used for

decomposition are selectable. The transform itself is not adaptive in any way, but this

property can be used to incorporate an adaptive algorithm discussed in the next chapter. It

is useful to remember that this transform does not give better frequency or time resolution

than the DFT. The only practical difference between the two is that the DFT samples the

DTFT at a fixed set of points whereas the new transform is capable of sampling the

DTFT at any desired frequency. The DTFT itself does not change in any way and

contains all the artifacts that come with sampling and windowing. Each frequency that is

found in a time frame still appears in the spectrum as a sinc function and indeed there is

interaction between the sinc functions of adjacent frequencies. In fact if two components

that are very close in frequency are present in the signal, the interaction between their

respective sinc functions makes it difficult to see two distinct frequencies in the total

spectrum. This is a windowing issue and needs to be dealt with separately.

In the light of this new freedom of selecting the frequencies for decomposition,

frequencies between DC and the Nyquist frequency can either be selected at random or

by using some criterion. In the next chapter the issue of selecting frequencies that are

useful during adaptation is discussed.

44

Chapter 5 – Adaptive Time/Frequency Analysis

The frequency resolution problem encountered during short-term analysis was discussed

in the introduction of chapter 4. This chapter starts with a discussion on why adaptation is

necessary followed by a review of earlier work using similar algorithms. The issue of

how adaptation is used to get better frequency resolution is then elaborated, and the

chapter ends with an explanation of how the time-frequency matrix is set up.

5.1 – WHY ADAPTATION?

The aim of the procedure is to be able to capture the frequency information in each frame

and store it in a time-frequency matrix. This automatically results in data reduction

because there are only a limited number of frequencies actually present in an audio signal

during a given time frame. As an example, consider the signal to be a pure tone of 400Hz.

At sampling frequency 32kHz, a time frame of 40ms contains 1280 samples of data. This

frame of 1280 samples is shown in figure 5.1.1a. If an FFT is performed on the same

signal using the same number of points, the plot shown in figure 5.1.1b is obtained.

The dotted line in figure 5.1.1b shows the DTFT of the 400 Hz signal. The solid line is

the FFT, which is the sampled version of the DTFT. It is seen that index number 17 of the

FFT has all the energy because there are frequency bins at every 25Hz, and the frequency

of 400Hz exactly lines up with frequency bin number 17. If a peak detector is used in the

frequency domain, it will detect a peak at bin number 17 which means that bin 17 has the

highest amount of energy among all the bins. Instead of storing the 1280 samples

required to represent the signal in the time domain, the only information which needs to

be stored is the fact that the frequency of the signal is 400Hz as was found in the FFT and

the magnitude of this 400 Hz signal as specified by the FFT. This is a large reduction in

data.

45

Figure 5.1.1a 40ms frame containing 1280 samples of a 400 Hz signal, Fs=32 kHz

The above case is the simplest. Music usually has multiple frequencies and the

frequencies generally do not line up with frequency bins thereby resulting in spectral

spreading. When a 390 Hz signal of the same length is analyzed, since the frequency of

390 Hz is between the frequency bins at 375 Hz and 400 Hz, spectral spreading is seen in

the FFT. This is shown in figure 5.1.2.

The problem here is that if fixed frequency bins are used, it is not possible to find out the

actual frequency of the signal when it doesn’t line up with a bin. In this case, a peak

detector will again find that the highest amount of energy in the FFT is at bin number 17

which corresponds to 400Hz. The fact that the actual frequency is 390Hz cannot be

determined unless some special techniques such as the adaptation technique, which is

described later, are used.

46

Figure 5.1.1b DFT and FFT of a 40ms frame of a 400Hz signal

When analyzing real music with multiple frequencies, a peak detector is used to detect all

the peaks in the DFT in order to find the frequencies that are actually present in that

frame. This is followed by the adaptation algorithm, which locates the frequency of the

detected peak more accurately. But even before adaptation, there are some inherent

problems that must be noted. Consider the case of signal consisting of two closely spaced

frequencies of 390 Hz and 420Hz. If a DTFT analysis is performed on a 20 ms frame of

this signal, the DTFTs of the individual frequencies add up to give the DTFT of the

combination. This is shown in figure 5.1.3.

It is seen from this figure that although the signal is made up of two distinct frequencies

at 390Hz and 420Hz, the DTFT of the combination has only one peak at 404Hz. Even if a

peak detector is used followed by the adaptation algorithm that can exactly locate this

peak in the DTFT, it gives an inaccurate result because it finds only one peak at 404Hz

instead of one at 390Hz and one at 420Hz. This is because the DTFTs of the individual

frequencies are in the form of a sinc function and if the frequencies are very close

together, the side lobes and the main lobes of the two frequencies add up to give a result

that is very different from what is expected.

47

Figure 5.1.2 DTFT and FFT of a 390Hz signal of length 40ms

The problem gets worse as the separation between the individual frequencies gets

smaller. It also gets worse as the length of the window in time gets smaller. This is

because the smaller the window in time, the wider the individual spectra become and the

more the interaction between frequencies. Therefore, the peaks in a complex audio signal

are ‘moved’ due to frequency interaction and no longer correspond to the exact

frequencies present in the signal. This is purely a result of windowing the data and

analyzing it in small frames in time. It cannot be solved or improved using the adaptation

algorithm because the algorithm is only meant for finding the peaks in the DTFT of the

combined signal. If the DTFT of the combined signal is already ‘distorted’, the algorithm

simply finds the peaks that are slightly shifted in this ‘distorted’ DTFT. The longer the

frames that are used, the more this problem can be reduced. But when longer frames are

used, time resolution and transient information is lost.

48

Figure 5.1.3 DTFTs of individual tones of 390Hz and 420Hz along with the DTFT of the

combination tone

In general, it is preferable to use windows that are as long as possible without losing too

much time resolution. Another approach to solving this problem is by using special

windows that have reduced side lobes at the cost of wider main lobes. Some special

windows were experimented with, but it was found that the problem became worse due to

the widening of the main lobe. As long as frequencies in the signal are not very closely

spaced this problem is minimal. But at lower frequencies, where frequency separation

between individual notes tends to be small, the problem can be drastic. By dividing the

signal into frequency bands using a filter bank, the lower frequency bands can be

processed using larger frames as shown later.

49

5.2 – ADAPTATION

The adaptation algorithm used in this study is based on the algorithm developed by

Dologlou et al., 1996. The basic principle of the algorithm is very simple and is based on

a binary search method for literally zooming in on the frequency peak of the DTFT.

Consider the case of the 390Hz tone spanning over a frame of 40ms discussed earlier. Its

DTFT and 1280-point FFT are shown in figure 5.1.2. As can be seen from the figure,

since there is no frequency bin at 390Hz, the closest frequency bin to 390Hz which is at

400Hz contains the highest energy. If a peak detector algorithm is used, it finds the peak

at 400Hz. The objective is now to refine this estimate and get closer to the actual peak in

the DTFT, which is at 390Hz. First of all, it is important to note that this peak bin is

always found on the main lobe of the sinc function of the DTFT of the actual frequency.

It is now assumed that the DTFT can be ‘sampled’ at any desired frequency (which is

valid since a transform to do this was developed in chapter 4). If the DTFT is sampled at

frequencies just above (e.g. 400.001Hz) and just below 400Hz (399.99Hz), it is found

that the frequency just below 400Hz has slightly more energy than the point above (since

the peak is at 390Hz). This is shown in figure 5.2.1a.

This leads to the conclusion that the peak of the DTFT is below 400Hz (which is true in

this case because the peak is at 390Hz) but above 387.5Hz (since the previous frequency

bin is at 375Hz). It is guaranteed that the peak frequency is at a frequency that is higher

than the midpoint of 375Hz and 400Hz (387.5Hz). This is because if the peak frequency

were less than 387.5Hz, the peak frequency bin would have been at the 375Hz. With this

in mind, the spectrum is sampled at the midpoint between 400Hz and 387.5Hz

(393.75Hz). Here, the direction the slope is headed in is again checked, by sampling the

DTFT at frequencies just above and just below 393.75Hz. This is shown in figure 5.2.1b.

It is found again that higher energy is present in the frequency just below 393.75Hz.

Again, it is concluded that the peak frequency is below 393.75Hz but above 387.5Hz.

The spectrum is again sampled at a frequency half way between these two points and the

procedure is continued for as many iterations as required. At each step, the frequencies

50

just above and just below the frequency found in that iteration are sampled to find out

whether the peak is at a higher or lower frequency than the present frequency. Using this

method, it is possible to get as accurate an estimate of the peak frequency as required.

Figure 5.2.1a Sampling the spectrum at frequencies just above and just below the 400 Hz

peak frequency bin

Figures 5.2.1b Sampling the spectrum at frequencies just above and just below 393.75Hz

51

The algorithm just described can be implemented only if the DTFT can be sampled at the

required frequencies. This cannot be done using conventional FFTs or DFTs, which have

fixed bins. But if the transform described in chapter 4 is used, frequency bins can be

chosen arbitrarily. This means that the freedom of ‘sampling’ the DTFT at different

frequencies, which was assumed above can be implemented with this new transform.

Selection of Frequencies

The basic procedure for the adaptation algorithm has been described in the previous

section. With this in mind, the initial set of frequencies used for decomposition can be

selected to give the best result for a given number of frequencies. Assume that N

frequencies between 0 and π are to be selected to decompose a signal with 2N points. An

FFT would have N equally spaced frequencies between 0 and π including 0 and πwhich

correspond to DC and the Nyquist frequency. If adaptation is being used though, the

algorithm can adapt itself to any frequency within half the frequency interval between

adjacent bins and that too on both sides of the peak bin frequency bin. In the case

described in the previous section describing adaptation, the frequency spacing between

bins is 25Hz. If the peak frequency bin is found to be at 400Hz, the algorithm can adapt

itself to any frequency within 12.5Hz on either side of 400Hz. This holds true for all

frequency bins. This also means that frequency bins at DC or at the Nyquist frequency

are not required. Instead, it suffices to have frequency bins located at 12.5Hz above DC

and at 12.5Hz below the Nyquist frequency and spaced at 25Hz everywhere else. A

general spacing of frequencies is shown in figure 5.2.2. Here d2 is the spacing between

adjacent frequency bins and d1 is the spacing between DC and the first bin as well as the

spacing between the Nyquist frequency and the last bin.

52

Figure 5.2.2 Example of frequency spacing for the non-adaptive transform

The case where 2/21 dd = is used since even if there is a frequency component at DC, it

is within the range of the first bin in terms of adaptation. The same applies with regard to

the Nyquist frequency and the last bin.

Adaptive Algorithm by Dologlou et al., 1996

The algorithm developed by Dologlou et al is shown in figure 5.2.2. This algorithm is

based on minimizing the energy error due to ‘spectral spreading’ and was implemented

only for the case of a simple pure tone signal. The algorithm starts with some

initializations for variables. The initial set of frequencies to be used for the transform is

then selected. After computing the transform using these frequencies, a peak detection

algorithm is used to detect the peak in the calculated spectrum.

53

Figure 5.2.3 – Adaptive algorithm by Dologlou et al.

54

The energy error, which is defined as the sum of the energy in all the bins except the peak

bin is calculated. The peak frequency is replaced by a frequency that is just below it by an

amount MaxDifference. The transform is then recalculated using this new frequency. The

new energy error is calculated using the same criterion as before after applying the new

transform. If it is less than the previous energy error, the peak frequency is replaced by

the mean of the peak frequency bin and the frequency bin just above it. Otherwise, it is

replaced by the mean of the peak frequency bin and the frequency bin just below it. It

must be noted that the frequencies of the peak bin and that of the bins below and above it

are being continuously updated in this process. The process terminates when the peak bin

calculated in the present iteration and the one calculated in the previous iteration differ by

an amount less than maxdiff. It should be noted that this method works well only for

signals which have a single frequency component. The energy error criterion becomes

difficult to use for signals that have multiple frequencies.

Adaptive Algorithm by Matt Kotvis, 1997

Matt Kotvis improved upon the above mentioned algorithm to produce a more efficient

method better suited to typical audio signals. He used filter banks to decompose the

signal into 10 octaves starting at 22.05kHz and downwards. Each octave was analyzed

using a different window length depending on the number of frequencies that were used

in the transform for that octave. The number of frequencies used for decomposition

decided the size of the decomposition matrix and therefore the window length. The table

showing the window lengths in milliseconds for a 4, 8 and 16 frequency transform is

shown in figure 5.2.3.

He found that the arrangement that gave the best trade off was using 16 frequencies in the

highest seven octaves, 8 frequencies in the third octave and 4 frequencies in the lowest

two octaves. Since the consecutive octaves were down sampled (described in chapter

2.4), the same transformation matrix containing 16 frequencies between 0 and π could be

55

used for the top seven octaves followed by an 8 frequency matrix for the third octave and

the same 4 frequency matrix for the first two octaves. Since the same matrix was used for

different octaves, a conversion was required to calculate the actual frequency in that

octave and make up for the down sampling operation.

Octave 4 Frequencies 8 Frequencies 16 Frequencies

10 (10 – 20 kHz) 0.363ms 0.726ms 1.45ms

9 (5 – 10 kHz) 0.726ms 1.45ms 2.9ms

8 (2.5 – 5 kHz) 1.45ms 2.9ms 5.8ms

7 (1.25 – 2.5 kHz) 2.9ms 5.8ms 11.6ms

6 (625 – 1250 Hz) 5.8ms 11.6ms 23.2ms

5 (312 – 625 Hz) 11.6ms 23.2ms 46.4ms

4 (156 – 312 Hz) 23.2ms 46.4ms 92.9ms

3 (78 – 156 Hz) 46.4ms 92.9ms 186ms

2 (39 – 78 Hz) 92.9ms 186ms 372ms

1 (0 – 39 Hz) 186ms 372ms 743ms

Figure 5.2.4 Table showing octaves and analysis window lengths depending on number

of frequencies used in the transform

The flow chart for his algorithm is shown in figure 5.2.5. This algorithm works basically

the same way as the one by Dologlou et al., but with some additional improvements. First

the total energy in each time block is calculated. All blocks with energy below a certain

threshold are ignored during the adaptation process. This is to guard against very quiet

sections which have a very low signal to noise ratio thereby resulting in the detection of

totally irrelevant frequencies during that period. Second, several frequency peaks could

be adapted to, in a single block provided that it is established that there is more than one

frequency peak in that time block of that octave. Also no frequency which is less than 10

dB of the maximum energy in a block is adapted to. Additionally, no more than one

quarter of the frequencies in a time block could be adapted to.

56

Figure 5.2.5 – Flowchart for Matt Kotvis algorithm

(Taken from Kotvis, 1997)

57

The only purpose of having these constraints is to reduce the computation time by getting

rid of time/frequency points without significant energy. Another improvement in the

algorithm is a condition that forces any time/frequency point to stay within certain

boundaries during adaptation. In the previous algorithm, for certain signals, the

adaptation algorithm would not converge on nearby frequencies but would instead

continuously move to a higher or lower frequency till until it was virtually on top of

another frequency. This problem is solved by forcing the adaptation to move ‘inwards’

after the first iteration. This is necessary because the first iteration samples the DTFT at

half the frequency interval between the lower and higher bin. If the algorithm were

allowed to move continuously higher or lower, it would converge on the adjacent bin. So,

if the first iteration is towards the left (lower in frequency), the second iteration is

towards the right. The other criterion that is changed in this improved algorithm is the

criterion for ceasing adaptation. Whereas in the previous case, the criterion for ceasing

adaptation was when the difference between the adapted frequency of the present and the

previous iteration was below a certain threshold, in this case it ceases adaptation after

exactly 10 iterations in all cases.

This algorithm has certain drawbacks. Unlike the non-adaptive case, where one

frequency set is used, the adaptive distribution could have a large number of frequencies

used for decomposition. This could potentially create a different frequency set used for

each and every time-frequency block and this leads to large data size. Another

disadvantage with this algorithm is that it does not know when a signal has been exactly

matched. For example, even if a signal lies exactly on a frequency bin, it tries to adapt to

it by shifting the frequency of the bin away from it and then continuously towards it,

thereby minimizing error after 10 iterations. Also, the time-frequency blocks that are used

are fixed. There are only 3 sets of transforms using 4, 8 or 16 frequencies, which are used

during decomposition and block sizes for each band are fixed. Considering that different

types of audio data have different properties in terms of frequency content and dynamics,

this may not be the best approach. A better approach would be to have flexible block

lengths, a flexible number of frequencies used for decomposing and a flexible number of

58

iterations for adapting in each block depending on the type of music or audio signal being

processed. Also, the fact that there are only 3 sets of transforms used implies that there

are only certain window lengths in time that can be used. In the case of this algorithm, the

lowest sub-band (discussed later) is processed at 186 ms, which is too large a window

length for music and leads to very poor time resolution. The highest sub-band is

processed at window lengths of 1.45 ms, which is too small and leads to increased

spectral spreading.

5.3 IMPROVED ADAPTIVE ALGORITHM

The algorithms mentioned above have certain weaknesses that are corrected in the

improved algorithm developed in this study. The improvements are as follows:

Improvements

1) The algorithm by Matt Kotvis uses filter banks to divide the signal into frequency

bands but does not make full use of them. Based on the properties of the signal, it is

useful to be able to process different bands of data using different parameters. The

algorithm developed in this study allows for any required window length to be used

for any sub-band. Thus the optimum window length for a given sub-band can be used

while processing it.

2) The algorithm also allows the user to choose the compression level by adjusting the

maximum number of detected peaks. A ceiling can be put on this, so that the data rate

of the new representation never exceeds the required amount.

3) Also, a psychoacoustic model is used to discard redundant information in each

frequency band thereby reducing data. This validates the need for frequency bands so

that the model can be applied to each band separately.

4) Variable frequency resolution is used to reduce the computational power required.

This method uses the JND in frequency of the human ear discussed in chapter 3.

5) The DFT matrix is the inverse of the IDFT matrix. However it can be obtained from

the IDFT matrix without using matrix inversion. This observation is extended to the

algorithm being developed and the process of matrix inversion, which is required in

59

the algorithm by Dologlou et al. and Kotvis, is skipped. This results in increased

computational speed.

Since the filter bank is critical to all these improvements, this section begins with a

discussion on the filter bank used to separate the signal into bands. This is also the first

step of the algorithm.

The Analysis Filter Bank

When using filter banks, it is common to have an analysis as well as a reconstruction

section. Filter banks are discussed in 2.4 and figure 2.4.3 shows a typical analysis filter

bank which divides the signal into 10 octaves. In this study the signal is divided into six

bands and this is shown in figure 5.3.1. The reasons for dividing the signal into six bands

are discussed later. At each stage, there is a high pass filter (H1) and a low pass filter

(H0), both followed by decimation stages. As a result of decimating the signal in the

manner shown in figure 5.4.3, the same high pass and low pass filters can be used for

each stage though the cutoff frequencies in Hz at each stage are different. This is because

a typical digital filter is just a combination of coefficients that are adjusted such that there

is a cutoff frequency at some frequency between 0 (DC) and π (Nyquist frequency). The

cutoff frequency in Hz depends on the sampling rate of the input signal. As an example,

if a filter as a cutoff frequency at 2/π , it means that for an input signal with a sampling

rate of 32kHz, the cutoff frequency is at 8kHz. But if the input signal has a sampling rate

of 8kHz, the cutoff frequency shifts to 2kHz. It turns out that because of this reason and

the fact that the signal is being decimated, the same high pass filter coefficients and low

pass filter coefficients can be used at each stage.

At each stage the signal is divided into a high-passed version that has frequencies from

2/π to π and a low-passed version that has frequencies between 0 and 2/π . Due to

aliasing concerns during down sampling, it must be ensured that the attenuation for both

these filters is sharp and there is minimum energy leakage. The low-pass filter has to be

designed such that the attenuation at frequencies of 2/π and above is very high. The

same holds good for the high-pass filter with regard to frequencies of 2/π and below. In

60

the case of a perfect reconstruction filter bank, some aliasing is allowed during analysis

because the synthesis filter bank is designed to cancel it out. Since only the analysis

section is used, such perfect reconstruction filters are not required and higher order filters

with narrow transition bands can be used.

Figure 5.3.1 – Filter bank dividing the signal into six bands

In this study, an eighth order elliptic IIR (Infinite Impulse Response) filter is used for

anti-alias filtering. Also, the zero-phase digital filtering technique is used to avoid phase

distortion, which is inherent in high order IIR filters. Zero-phase digital filtering

(Jeong&Guon Ih, 1999) is conducted by processing the input data in both the forward as

well as reverse directions. After filtering the data in the forward direction, this filtered

sequence is reversed and run through the filter. The resulting sequence will have zero

phase distortion and double the filter order. The elliptic filter is chosen for its narrow

transition-band and high stop-band attenuation. The low-pass filter has a cutoff frequency

cf =0.468π so that at the frequency of 0.5π , the attenuation is at least 100dB. The

attenuation of 100dB is chosen because 16-bit quantized signals have a dynamic range of

around 96dB. Similarly an eighth order elliptic high-pass filter is used to perform the

high pass filtering. It has a cutoff frequency at 0.532π . The magnitude responses for both

these filters are shown in figures 5.3.2a and b. The drawback in this method is that if

there are any frequencies that are equal to or in the close vicinity of 0.5π , they are

sharply attenuated. Since the zero-phase filtering technique is used, the phase response of

both filters is ultimately zero and is not shown. The coefficients are given in appendix A.

61

Figure 5.3.2 (a) – Low-pass filter magnitude response and (b) High-pass filter magnitude

response

The sampling rate of the input signal is chosen to be 32 kHz. The filter bank divides the

signal into six sub-bands between DC and 16 kHz. The six sub-bands are as follows:

1) y6: 8 – 16 kHz

2) y5: 4 – 8 kHz

3) y4: 2 – 4 kHz

4) y3: 1 – 2 kHz

5) y2: 500 – 1000Hz

6) y1: 0 – 500Hz

The signals y1 – y6 are the filtered versions of the input signal and contain frequency

information equivalent to what is shown above. It is important to note that due to the

decimation process these signals do not actually sound like filtered versions of the signal

because their spectrum has been mirrored. As an example, y6 contains information

between 8 – 16 kHz of the original signal but it is present in y6 between 0 and 8 kHz.

After analysis of this signal, a group of detected peaks between 0 and 8 kHz are detected.

But using a conversion formula, these are converted to their actual values between 8 and

16 kHz in the original signal. The same holds true for all the sub-bands.

62

Time-Frequency Blocks

The next step in the algorithm is to set up the time-frequency blocks for each sub-band.

As discussed earlier, the algorithm is flexible enough so that it can be adapted to the type

of audio signal being analyzed. Each sub-band signal is processed separately to give the

time-frequency distribution for that sub-band. The block (or frame or window) lengths

for all the sub-bands are in the region of 10ms – 80ms. In general, the lower frequency

sub-bands are processed using longer block lengths than the higher frequency sub-bands

to reduce the spectral spreading problem discussed earlier. At the beginning of the

algorithm, there is a set of variables (including block length) that are set to certain

optimum values depending on the type of signal being analyzed. For example, the

following are the block lengths used for the six sub-bands of a signal shown in table 5.3.1

if the expected signal is of a “single instrument” type.

Sub-band Block length

chosen

Samples/block Number of

frequencies

used in

decomposition

Size of

transform

matrix

Y1 (0 – 500Hz) 80 ms 80 40 80x80

Y2 (500 – 1000Hz) 40 ms 40 20 40x40

Y3 (1 – 2kHz) 40 ms 80 40 80x80

Y4 (2 – 4kHz) 40 ms 160 80 160x160

Y5 (4 – 8kHz) 40 ms 320 160 320x320

Y6 (8 – 16kHz) 20 ms 320 160 320x320

Table 5.3.1

These block lengths are empirically derived by experimenting with various settings for

various instruments, and using the set that gives the best results. The size of the transform

matrix (the number of frequencies used for decomposition) depends on the block length

and the algorithm calculates the suitable transform matrix depending on the block length

chosen. The only constraint for choosing block length is that it must be a number that is

63

in the form of n25× milliseconds, where n is any positive integer. In other words, 5ms,

10ms, 20ms, 40ms, 80ms etc. are all suitable. This constraint is present because the signal

must be divided into an integer number of blocks and the algorithm is written so that if

these window lengths are used, the signal can be divided into an integer number of

blocks.

Maximum Frequency Peaks

Another set of variables which need to be set at the beginning of the algorithm are the

number of peaks that are adapted to in each block of each sub-band. The number of

frequency peaks that are detected depends purely on the music and the number of peaks

present in the DTFT. But the number of peaks that are actually retained after applying the

psychoacoustic model is less than that number. In some cases even this number is too

high. So, a ceiling is set on the number of peaks that are stored for every block of a given

frame. This variable is used to set the maximum number of peaks that can be stored for

every block of a given frame. Again this is set up depending on the type of audio signal

being analyzed and the amount of compression required. Knowing the expected audio

signal gives an estimate of which band contains the majority of the information and thus

this variable is set up accordingly. As an example, the maximum number of frequency

peaks to be stored for a signal that is expected to be a guitar chord is shown below in

table 5.3.2. In this study these numbers are derived empirically.

Sub-band Maximum frequency peaks

Y1 (0 – 500Hz) 6

Y2 (500 – 1000Hz) 6

Y3 (1 – 2kHz) 8

Y4 (2 – 4kHz) 12

Y5 (4 – 8kHz) 12

Y6 (8 – 16kHz) 0

Table 5.3.2

64

Variable Frequency Resolution

The next set of variables that need to be set up are the number of iterations that need to be

performed during adaptation. This is directly related to the frequency resolution of the

human ear described in chapter 3.2. It is seen from figure 3.2.1 that the human ear only

has a limited frequency resolution, which is different at different frequencies. As an

example, according to figure 3.2.1, at 500Hz, the frequency resolution of the average

human ear is around 4Hz. In other words, the human ear cannot distinguish between

500Hz and 504Hz. This in turn implies that while the algorithm adapts to some peak

frequency, and zooms in on the exact peak frequency in the DTFT, there is only a certain

level of accuracy required before the human ear fails to distinguish the difference (Jeong

& Jeong-Guon, 1999). This means that only a certain number of iterations are required

while adapting to the peak frequency before the human ear cannot distinguish the

difference. This in turn results in reduced calculations. For a given sub-band of

frequencies, the number of iterations required to get within a given range from the actual

peak frequency can be calculated. Consider the case of the signal being of the single

instrument type. From table 5.3.1, it can be seen that for the sub-band signal y2

consisting of frequencies between 500 and 1000Hz, 20 frequencies are used for

decomposition in each block. This means that the frequency separation between adjacent

frequency bins is roughly:

Hzf 2520

500

20

5001000 ==−=∆

The frequency separation is very close to but not exactly equal to 25Hz because the bins

are separated by equal distances everywhere except the first and last bins as discussed in

the section “Selection of Frequencies” in chapter 5.2. This means that any located peak is

already within 12.5 Hz of the actual peak in the DTFT. The next iteration results in a

frequency, which is within half that distance from the peak frequency, and every

subsequent iteration during adaptation converges upon the peak frequency in the DTFT.

From figure 3.2.1 it can be observed that the frequency resolution of the ear between

65

500Hz and 1000Hz is between 4Hz and 5Hz. Even if a conservative estimate of 1Hz is

assumed, to make up for listeners with very good resolution, it is found that only 4

iterations are needed to get within 0.78125Hz of the actual peak frequency.

Using a similar criterion for each band of frequencies, a specific number of iterations

required per sub-band can be set up. This is known as variable frequency resolution

because the frequency resolution depends on the specific sub-band being analyzed. This

saves computation time compared to the method used by Kotvis, where every frequency

was adapted to using 10 iterations. An example is shown below in table 5.3.3 for the case

of the signal being of single instrument type.

Sub-

band

Frequency

Range

Number of

frequencies

in

transform

f∆

between

bins in

Hz

Frequency

resolution

of the ear

from

Figure

3.2.1

Conservative

estimate of

frequency

resolution

Number of

iterations

required

Y1 0 – 500Hz 40 12.5Hz 3Hz 0.5 Hz 5

Y2 500 – 1000Hz 20 25Hz 4Hz 1 Hz 4

Y3 1 – 2kHz 40 25Hz 5Hz 2 Hz 3

Y4 2 – 4kHz 80 25Hz 10Hz 4 Hz 2

Y5 4 – 8kHz 160 25Hz 20Hz 7 Hz 2

Y6 8 – 16kHz 160 50Hz - 10 Hz (by

extrapolation)

3

Table 5.3.3

The number of iterations required is thus calculated in the last column of table 5.3.3

depending upon the various factors shown, and set up at the beginning of the algorithm.

While adapting to a frequency peak, this reduces computations by a factor of between 2

and 4.

66

Setting up the Transform Matrix

The transform matrix, its creation and how it was used in the previous algorithms was

described in chapter 4. In brief, it involved choosing a set of frequencies, creating the

matrix containing the basis functions and then inverting it to get the kernel. This

procedure is derived from general matrix theory and is the correct way to create the

transform matrix. However there is an easier way to obtain the kernel. Consider the DFT

matrix and its inverse the IDFT matrix. The IDFT matrix contains the basis functions and

when it is inverted the DFT matrix is obtained, which contains the kernel. They are

related as:

1−= DFTIDFT (5.3.1)

If x(n) is an N point sequence of data, then the DFT and the IDFT operations respectively

are as follows:

∑−

=

=1

0

)(N

n

mnNWnxDFT , m=0,1,…,N-1

(5.3.2)

∑−

=

−=1

0

)(1 N

n

mnNWnx

NIDFT , m=0,1,…,N-1

where NjN eW /2π−=

These two operations are the inverse of each other and yet the only difference between

them is the sign in the exponent of NW and also an overall gain constant of 1/N. In fact if

the DFT and IDFT operations are performed on the same sequence of data, the only

difference between the magnitudes of the outputs of the two would be a scaling factor

1/N. This means that for sampling the DTFT of a given sequence, the IDFT

transformation could easily be used with a scaling factor of 1/N to get the same result as

67

the DFT operation. This is an important result and is used extensively during this study.

Since only the magnitude characteristics of the DTFT are important, while forming the

transform, the process of inverting the matrix can be skipped as demonstrated in chapter

4. Instead, the transformation matrix is formed by multiplying the basis function matrix

of size NxN with the scaling factor 1/N. Thus the procedure of matrix inversion is

avoided each time the transformation matrix is formulated. This greatly reduces

computational complexity.

Also, in the algorithm by Dologlou et al as well as the improved one by Matt Kotvis, the

criterion of least energy error is used to find the peak energy. For each step, the entire

transform matrix is recalculated by replacing just one basis function and then inverting

the matrix. Further, at each step energy error is calculated by applying this entire new

transform matrix and then using the formula for calculating energy error which is the sum

of energy in all bins except the present frequency bin. These calculations require a large

amount of processing power especially since they are done at every step and iteration

during adaptation.

The same results are obtained by noting that only the peaks in the DTFT need to be

found. This means that there is no necessity of calculating the entire transform at each

step and no need for calculating energy error during adaptation. The energy in the peak

frequency bin, which was found when the transform was used, is first stored. Then by

shifting the frequency in the direction where the energy increases, the peak frequency in

the DTFT can be converged upon. At each step it is only necessary to calculate the basis

function corresponding to the desired frequency (1 row of the transformation matrix).

The energy in that one frequency bin is calculated and then compared to the previous

case. In fact this procedure is very similar to the one outlined in the first section,

“Adaptation” in chapter 5.2.

68

The following then, is a summary of the initialization procedure including user input for

the algorithm.

1) Enter the type of signal being processed (musical piece, single instrument, voice?)

2) Depending on the type of signal, certain parameters that were found to be optimum

are already set up for each type of signal, but can be changed if required. The

following are the parameters:

a) Block length for each sub-band: blocklength_1 – blocklenth_6

b) Number of iterations during adaptation for each sub-band: iteration_1 –

iteration_6

c) Maximum number of frequencies to adapt to for each sub band: peaks_1 –

peaks_6

d) Type of window to be used (if other than a rectangular window).

3) Check if the total number of samples in the input signal is a multiple of 1280. This is

required for decimation to be done properly. If it is not a multiple, pad zeros equally

on the left as well as right, so that it is a multiple.

4) Divide the signal into 6 sub-bands using the filter bank and store these sequences (y1

– y6)

5) Consider the block length for each sub band and call a function that creates the proper

transform matrix based on the block size required for each band. The 6 transform

matrices are stored.

6) Process each block of each sub band using the proper transform matrix and then apply

the adaptation procedure specified in the next section.

Adaptation Procedure

This section describes the adaptation procedure for a single block of a given sub-band.

After setting up all the variables as well as the transform matrix as outlined in the

previous section, adaptation can begin. Figure 5.3.3a outlines the initialization procedure

for a single block of data and figure 5.3.3b demonstrates the adaptation procedure for a

single frequency peak.

69

)(ny

YES

NO

Figure 5.3.3a – Initialization procedure for the new adaptation algorithm

(Already set up)block length = blIterations = iMaximum Peaks = p

Apply a window onthe block if required

Apply the transformation matrix alreadycalculated on the block to get the transformedsequence (spectrum) )(mY

Detect peaks in )(nY . A peak is defined as a point

with greater magnitude than both its adjacent points

Are totalnumber ofpeaksdetected > p?

Apply psychoacousticmodel to discard redundantpeaks

Perform adaptation for theremaining peaks

Sort the peaks in order ofmagnitude and keep only thehighest magnitude peaks

70

YES

NO

K is Positive K is Negative

K = zero

Figure 5.3.3b – Actual Adaptation

Diff=0.001Hzfd=(frequency bin spacing)/2newfreq=freqnewenergy=energyoldenergy=energymult=1

Ismult=0?

Do i times

B

plusfreq=newfreq+diffplusenergy=energy in theDTFT at frequency “plusfreq”

minusfreq=newfreq-diffminusenergy=energy in theDTFT at frequency “minusfreq”

K=plusenergy-minusenergy

fd=fd/2newfreq=newfreq+fdnewenergy=energy inthe DTFT for newfreq

fd=fd/2newfreq=newfreq-fdnewenergy=energy in theDTFT for newfreq

Both energies are the same.mult=0

Return newfreq andnewenergy. EXIT. END

A

B

A

71

In figure 5.3.3a, the initialization procedure for the adaptation algorithm is shown. In the

previous section, the calculation of the parameters for block sizes, maximum numbers of

frequencies and number of iterations for each block of each sub-band are shown. These

are the starting inputs for the initialization section. The signal (block) is fed into the first

section. If required, the block of data is windowed using a special window such as a

Hamming, Hann or Blackmann window. The transformation matrix (which has already

been calculated) is then applied on this (windowed) block of data to obtain the

transformed sequence.

In the next step, a peak detector is applied on the magnitude characteristics of this

transformed sequence. The peak detector detects and stores all the frequency peaks and

their energies as found in the transformed sequence. These are the frequencies that are

actually present in the block of data but are still inaccurate. Adaptation is performed for

each peak to find a more accurate estimate of the actual peak in the DTFT. If the number

of peaks exceeds the maximum number of peaks set up at the beginning, only the peaks

that are highest in magnitude are kept to remain within this constraint. A psychoacoustic

model is then applied to further discard redundant frequencies. In general, any

psychoacoustic model may be applied to discard peaks in an intelligent manner. At

present, only one psychoacoustic criterion is being used. From figure 3.2.2, it is seen that

a 50dB tone of frequency 415Hz at the center of an octave successfully masks

frequencies in that octave whose magnitudes are more than 20dB below it. Each sub-band

consists of an octave of information. All frequencies in that block in that sub-band that

are more than 20dB in magnitude below the highest peak frequency in that block are

discarded. Thus the number of peaks are further reduced. The next step is to adapt to each

one of these peaks.

The adaptation process for a single peak is shown in figure5.3.3b. To better understand

this procedure, the following is an outline of what each variable stands for:

72

Variable Explanation

Energy Energy in the frequency peak before adaptation

Freq Frequency to be adapted (frequency of the peak bin)

Newfreq Frequency variable that is moved continuously in a binary search towards

the actual peak

Fd Incremental amount used to change newfreq at each iteration. It reduces

by half after each iteration

Oldenergy Energy in the DTFT for the frequency corresponding to the previous stage

of adaptation

Newenergy Energy in the DTFT for the frequency corresponding to the present stage

of adaptation

Diff A minute incremental value to increase and decrease newfreq at each

iteration to find out in which direction the peak is

At the end of the adaptation, newfreq is returned as the adapted peak frequency and

newenergy is returned as the adapted peak energy. The algorithm starts with more

initialization. The next step is the iteration loop which is set to perform i times as set up

earlier. This is followed by a check for a variable named mult. As long as mult is not

zero, the iteration continues till the end, but if mult is zero, it exits and returns the

frequency and energy found during the present iteration of adaptation. The reason for this

is that mult is set to zero later if it is found that newfreq is exactly on the DTFT peak. The

next step is to increment the newfreq by a small amount and find the energy in the DTFT

at that frequency. This is done not by recalculating the whole transform (including the

inverting procedure) and applying it on the signal, but by only changing the relevant rows

in the present transform according to this new frequency. Then, instead of applying the

whole transform on the signal as was done in previous papers, only the relevant rows are

applied and the new energy is calculated. The same is done for a similar decrement in

frequency. The two energies thus obtained are compared. If the first energy is greater

than the second, then newfreq is incremented by fd/2. If the second energy is greater than

the first, then newfreq is decremented by fd/2. Thus by reducing fd by a factor of two

73

during each iteration and continuously moving towards the direction of greater energy, it

is possible to be within the required range of the actual peak in the DTFT. This is how

adaptation is carried out for a single frequency. This procedure is repeated for every peak

in every block of every sub-band and these peaks and their energies are stored. The next

step is to arrange them in a time-frequency matrix.

5.4 - TIME-FREQUENCY REPRESENTATION

As described above, all the frequency peaks for each block of each sub-band and their

energies are stored. The reader will recall that due to decimation, the spectral

characteristics of each band were mirrored around their new Nyquist frequency after

decimation. This means that a conversion is required to convert the detected frequencies

into the actual frequencies present in the original signal sampled at 32kHz. The formula

for conversion is as follows:

−=

−

π21.

2

2. 6dig

octaves

act

fFf (5.4.1)

digf is the digital frequency that was located and adapted to in the decimated signal and

actf is the actual value of this frequency as present in the original signal. Using this

equation every adapted peak frequency is converted back to its actual frequency. Now the

adapted frequency peaks of each sub-band as well as their magnitudes are arranged in

two separate matrices. They are both arranged so that every column corresponds to a

block or frame in time. If there are N blocks that were analyzed, then there are N columns

in the matrix. In the first matrix, each column contains the actual adapted frequency

peaks that were found in that block. In the second matrix, each column contains the

corresponding magnitudes of the adapted frequencies. Together, these two matrices form

a time-frequency representation for the music. The music is later re-synthesized using the

data in these time-frequency matrices. This is covered in the next chapter.

74

Chapter 6 – Synthesis Using Spectral Interpolation

Having created the time frequency matrix as described in the previous chapter, the next

step is to test the quality of this alternative method of representation. It might achieve

good compression, but the question is how good it sounds when it is re-synthesized.

There are a variety of ways to re-synthesize the music using the time frequency matrix.

However not all of them sound equally good. The choice of the re-synthesis method is

therefore not straightforward and a method must be found that is best matched to the

unique sinusoidal model used for analysis. A method that uses spectral interpolation

between frames is chosen because it is found to give the best results.

6.1 – WHY SPECTRAL INTERPOLATION?

There are a variety of methods that can be used for re-synthesizing the music based on

the time-frequency matrix and some of these are outlined in chapter 2.5. On a basic level

the model of additive synthesis shown in figure 2.5.3 is used. Each sub-band has its own

time-frequency matrix with possibly differing block lengths for different sub-bands. Each

time-frequency matrix has consecutive columns representing consecutive frames. Each

frame contains certain frequencies in the frequency matrix and their corresponding

magnitudes are in the magnitude matrix. Since the frequency and magnitude parameters

change frame by frame, the additive synthesis model shown in figure 2.5.3 could be used

by simply changing the parameters at each frame. But the problem with such a simple

method is that it doesn’t take into account the ending phase of the synthesized signal

from one frame and the starting phase of the synthesized signal in the next frame. When

these two frames are synthesized separately and simply pasted together, the result is

almost certainly a discontinuity at every frame boundary. This is heard as a pop at every

frame and when played back at the correct rate is heard as a continuous scratching sound.

As an example, consider a signal with 32kHz sampling rate consisting of three

75

frequencies. The three frequencies are 1000Hz, 1250Hz and 1500Hz and they are present

in equal proportion. Assume that this signal is analyzed in blocks of 25 ms, which

correspond to 800 samples per block. Focus on the first two blocks consisting of a total of

1600 samples. This is shown in figure 6.1.1.

Figure 6.1.1 Original signal containing three frequencies segmented into two blocks

When these blocks are analyzed, to form a time frequency matrix containing two columns

(for two blocks), it is found that both the blocks have the three frequencies at equal

magnitudes. Assume for the sake of simplicity that the analysis was perfect and that the

frequencies found were exactly 1000Hz, 1250Hz and 1500Hz at a magnitude of 1 each.

This can be represented in two time-frequency matrices as follows:

Block 1 Block 2

Frequency 1 1000Hz 1000Hz



Table 6.1.1a Frequency matrix for the first two blocks

76

Block 1 Block 2

Magnitude 1 1 1

Magnitude 2 1 1

Magnitude 3 1 1

Table 6.1.1b Magnitude matrix for the first two blocks

If the signal for block 1 is simply synthesized with frequencies and magnitudes as shown

using additive synthesis and then similarly, a signal is synthesized for block 2, two

synthesized signals of 800 samples each are obtained for the two blocks. To obtain the

full re-synthesized signal, these blocks are simply pasted next to each other. This re-

synthesized version is very similar in characteristics to the original except that between

the 800th and the 801st sample, there is a change in phase, creating a discontinuity which

results in a pop when it is played back. The last 100 samples out of the 800 synthesized

samples of block 1 are shown in figure 6.1.2a.

Figure 6.1.2a The last 800 samples of the synthesized block 1

77

Figure 6.1.2b The first 100 samples of the synthesized block 2

The first 100 samples of the 800 synthesized samples of block 2 are shown in figure

6.1.2b. The samples in the vicinity of the 800th sample of the concatenated signal are

shown in figure 6.1.2c.

Figure 6.1.2c The samples around the 800th and 801st sample of the concatenated

synthesized blocks

78

As can be seen from figure 6.1.2c, there is a discontinuity at the point where the two

separately synthesized blocks are concatenated even for this simple case. In an actual

music signal, it is found that there is usually a slight change in a given frequency between

blocks. As an example, three consecutive blocks may contain the frequencies 500Hz, 505

Hz and 495Hz. This is to be expected as discussed in chapter 2.6. Also, the number of

frequencies in adjacent blocks may not be the same due to detection of new frequencies

as the characteristics of the music change. Under these circumstances there will be severe

discontinuities leading to severe distortion if this simple method is used.

Another way of dealing with this situation is by cross fading between adjacent blocks and

this was mentioned in chapter 2.5.5. This is not a very elegant technique because it only

eliminates the discontinuities but does not join the two blocks in a natural way. Typically

two adjacent blocks will contain very similar, but slightly different frequencies (example

500Hz and 505Hz). If the cross fading technique is used to concatenate the two

synthesized blocks, there will be a period where both the 500Hz and the 505Hz signal are

present in almost equal proportion. This will result in a frequency beating at 5Hz, which

is highly undesirable. If this phenomenon takes place for almost every frequency at each

frame, the result will be a distorted output.

From the study quoted in chapter 2.5 (Jean Claude Risset, 1966), it is known that what is

more likely happening is that the frequency of 500Hz originating from some instrument

is changing to 505Hz over the course of the block due to vibrato and other such effects. It

is therefore preferable to view two such similar frequencies in adjacent frames as being

related to the same source. A more desirable way of synthesizing the output is to

synthesize a frequency of 500Hz, which slowly increases over the course of the frame

and becomes 505Hz by the end of the frame. This may not necessarily be the way in

which it happened in the original signal. For example, the 500Hz frequency could have

been stable during the first half of the frame and then rapidly risen to 505Hz in the

second half of the frame in the actual signal. There is no way of knowing this though,

because the frame is analyzed as a whole and it is impossible to know about such changes

79

which take place within the frame itself. The best that can be done is to estimate what

happened within the frame and generate a tone whose frequency smoothly changes from

500Hz to 505Hz over the course of the frame. According to the study by Risset quoted in

chapter 2.5, it is also necessary to simulate these frequency variations in the synthesized

tone for overall quality. One way this could be done is by linearly increasing the

frequency from 500Hz to 505Hz over the course of the frame. This is known as linear

spectral interpolation. The problem with this is that over the course of many frames the

slope of the linear interpolation keeps changing giving rise to discontinuities in the curve

of frequency versus time. Since frequency is a derivative of phase, this could lead to

discontinuities in the curve of phase versus time, if it is not handled carefully. This

implies that there are discontinuities in the synthesized signal. Consider table 6.1.2 as an

example. This table shows the variations in frequency over four frames for a single tone

that needs to be generated. The variations considered are quite extreme, but this is only to

demonstrate the method. This table is an example of a simple time-frequency matrix.

Block 1 Block 2 Block 3 Block 4

Frequency 500Hz 600Hz 400Hz 500Hz

Table 6.1.2 A simple time-frequency matrix (magnitudes not shown)

If the synthesized tone is generated using linear frequency interpolation, the curve of

frequency versus time shown in figure 6.1.3 is used.

ftphase π2=

s = sin(phase) (6.1.1)

This equation holds good for a given point in time t where the frequency is found to be

f . Using these two equations, the output signal s can be synthesized. The output signal,

which is synthesized from a discontinuous frequency curve, is also discontinuous at the

same points that the frequency curve is discontinuous. The second discontinuity is more

obvious and is indicated in figure 6.1.4.

80

Figure 6.1.3 Frequency Vs Time curve using linear frequency interpolation

Though the frequency variations shown are quite extreme, the principle remains the same

and linear frequency interpolation can cause discontinuities in the synthesized signal if

phase issues are ignored. This problem could be solved by considering a method in which

the starting and ending phases are taken into account for each frame. A cubic phase

interpolation algorithm is used to achieve just this. The basics of this algorithm are

discussed in the paper by McCaulay & Quatieri, 1986. In order to use this algorithm

though, the time-frequency matrix must be analyzed and grouped into sets of frequencies

from frame to frame that are related to each other to form a frequency track. This is

explained further in the next section.

81

Figure 6.1.4 Synthesized signal with discontinuities

6.2 – FRAME TO FRAME PEAK MATCHING

If the number of peaks detected were constant from frame to frame and all the peaks in

one frame were related to the peaks in the next frame (e.g., 500Hz in the first frame and

505Hz in the next frame, 810Hz in the first frame and 805 Hz in the next frame), there

would be no problem of matching parameters from one frame to the next. In reality, there

is side-lobe interaction, which causes spurious peaks, and also there is vibrato effect as

well as the dynamic nature of the music itself which all result in a time varying spectrum.

Typically adjacent frames neither have the same number of peaks nor do they have all the

frequencies related to each other. At this point the matrix can be viewed as being one in

which columns of elements are grouped together since these are basically the frequencies

found in each frame. The aim is to find frequencies between frames that are related and to

arrange all these sets of related frequencies in rows, so that each of these rows is a

frequency track. As an example consider the time-frequency matrix in table 6.2.1. Each

column contains the frequencies identified in that frame in ascending order of frequency.

82

Frame 1 Frame 2 Frame 3 Frame 4

Frequency 1 500 505 405 405

Frequency 2 805 1040 504 502

Frequency 3 1050 1502 1040 1045

Frequency 4 1500 ---- 1505 1500

Table 6.2.1

An algorithm must be developed, which sorts this table in an optimum way so that a more

useful table is obtained that has the frequencies arranged in tracks. In fact table 6.2.1

must be sorted so that it resembles table 6.2.2.

Frame 1 Frame 2 Frame 3 Frame 4

Track 1 0 0 405 405

Track 2 500 505 504 502

Track 3 805 0 0 0

Track 4 1050 1040 1040 1045

Track 5 1500 1502 1505 1500

Table 6.2.2

This table should be viewed not as a collection of columns that contain the frequencies

found in each frame but rather as a collection of rows each containing a frequency track.

Notice that there are now more rows than before, but each row or track can be

synthesized with a tone of frequency varying according to the frequencies indicated in

that row (and also the magnitudes indicated in the magnitude matrix not shown here).

The final synthesized output is the sum of the individual synthesized outputs for each

track. In the case shown in table 6.2.2, there are 5 synthesized tones for 5 tracks, which

are added up in the end to give the total synthesized output. This new matrix can be

83

referred to as the “matched matrix” because the frequencies have been matched to create

frequency tracks.

The reason why a frequency in a given track is selected to be in that track is because it

satisfies some criterion for proximity with some other frequency in the previous frame of

that track. For example, in the example shown in table 6.2.2, the frequency of 505Hz is

chosen to be in the second frame of the second track because it satisfies a criterion of

proximity (e.g., the two frequencies are within 10 Hz of each other) with the frequency of

500 Hz in the previous frame. Obviously, it is a better match for the second frame in that

track as compared to the frequency of 1040 Hz, which is matched to track 4 in the same

frame. As an example, one condition that could be used is that for a certain frequency to

be chosen to be in a certain track, it must be within ± 10 Hz of the previous frequency in

that track. If this condition alone is used, and if there is a frame where there is more than

one frequency which satisfies this condition in the second track, then this condition alone

is insufficient to perform matching. Also, if there is a frequency in the second track,

which satisfies the condition for more than one frequency in the first track, then this

condition alone, is again insufficient. A set of conditions must be used to find an

optimum way for matching frequency peaks from frame to frame. A three-step procedure

for doing this is outlined below.

It is assumed that for a certain frequency track, matching has been performed up to the kth

frame and a match has to be found in the (k+1)th frame. Frequencies are denoted in the

form of xyw , where x is the frame number and y is the frequency number in that track.

The frequency being considered is knw , which is the nth frequency in the kth frame. A

match needs to be found for this frequency in the (k+1)th frame. The following is the

three-step procedure:

Step 1

First of all, it is important to remember that when discussing the creation of the new

matched matrix containing frequency tracks, there are actually two matched matrices –

one containing frequencies and the other containing magnitudes. Assume that there are

84

totally N frequency peaks in the kth frame and M frequency peaks in the (k+1)th frame.

The best frequency peak match in the (k+1)th frame must be found for the nth frequency in

the kth frame, which is knw . To begin with, some “matching interval” ∆ is set up, which

is equivalent to some frequency spacing in Hz within which a potential match must lie

with respect to knw to possibly be matched to it. All potential matches in the (k+1)th frame

are found by using the following criterion:

∆≤− +1km

kn ww , Mm ≤≤1 (6.2.1)

This condition is checked for all the M frequencies in the (k+1)th frame (except those that

are already matched to some other frequency in the kth frame) and all the frequencies that

satisfy this criterion are stored. If no frequency satisfies this condition, then that

frequency track is considered dead and it ends at that frame. If this is the case, the

frequency is matched to itself in the next frame of the matched matrix containing

frequencies but with zero magnitude in the matched matrix containing magnitudes. This

is required because during synthesis, when a track dies in the kth frame with a frequency

of f Hz, it is synthesized with a linear “fade-out” effect starting from frame k to the

(k+1)th frame. This is easily done by inserting exactly the same frequency in frame k+1

with a magnitude zero, so that the algorithm automatically performs the linear fade-out

effect. If instead of inserting the frequency of f Hz in the frame k+1, it is left with 0 Hz,

the algorithm not only fades out this last frequency but also changes it from f Hz to 0Hz,

which is undesirable. The remaining frames in the matched matrix containing magnitudes

are filled with zeros and step 2 can be skipped. If one or more frequencies are found that

are within this matching interval, and are therefore potential matches, one strategy would

be to simply match the frequency that is closest to knw and choose it as the best matched

frequency. Although this frequency 1+kmw is the closest match for k

nw in the (k+1)th frame,

there could be another frequency in the kth frame other than knw , which is a better match

for 1+kmw . So for optimum frequency peak matching, frequencies are checked for best

matching in both the next and the previous frame. This is described in step 2.

85

Step 2

By the time this step is reached, one or more potential matches in the (k+1)th frame have

already been found for the frequency knw in the kth frame. As discussed in step 1, some of

these potential matches in the (k+1)th frame may have better matches in the kth frame

than knw . All potential matches 1+k

mw are discarded if they cannot satisfy the criterion in

equation 6.2.2.

ki

km

kn

km wwww −⟨− ++ 11 for Nin << (6.2.2)

Here 1+kmw is the potential match in the (k+1)th frame. i must be greater than n because

all previous frequencies in the kth frame are presumed to be already matched and there is

no need to check them. Only potential matches that satisfy equation 6.2.2 are stored and

the rest are discarded. If there is only one frequency left, it is matched to knw . If there is

more than one frequency, the one that is closest to knw is matched to it. If there are no

frequencies left that can be matched to knw , that track is considered dead. This procedure

is repeated for every frequency in the kth frame so that each frequency is matched with

some frequency in the (k+1)th frame so that the track continues, or else if no match is

found, the track dies. It should be noted that many other situations are possible and could

be considered for better optimization, but to keep the tracker alternatives, only the

situations described above are considered.

Step 3

When all the frequencies in frame k have been tested for matches in the next frame and

have contributed to either tracks which continue or are dead, there may be some left over

frequencies in the (k+1)th frame that have not been matched to any frequency in frame k.

In this case, these frequencies are considered as the first frequencies of a track which

begins in the (k+1)th frame. In other words a track is born in the (k+1)th frame. In this

case a new frequency is created in frame k with zero magnitude. This is done for the same

86

reason described in step 1 with regard to a track that dies. In this case, when a track is

born, a similar “fade-in” effect is used during synthesis. An illustration of the birth/death

procedure is shown in figure 6.2.1. The result of applying the tracker to a segment of real

speech is shown in figure 6.2.2. This figure illustrates the ability of the tracker to adapt

quickly to voiced and unvoiced regions.

Figure 6.2.1 Illustration of the birth/death procedure

(Taken from McAulay & Quatieri, 1986)

Figure 6.2.2 Typical frequency tracks for real speech

(Taken from McAulay & Quatieri, 1986)

After this procedure is performed for all the elements in the time-frequency matrix, a

matched matrix is obtained. All that remains to be done is to synthesize each row (track)

of this matched matrix and add up all these synthesized tracks to produce the final re-

synthesized output.

87

6.3 SYNTHESIS USING CUBIC PHASE INTERPOLATION

In this section, the actual synthesis method, which uses the matched frequency matrix

obtained from chapter 6.2 is described. Cubic phase interpolation is used for obtaining

the smooth spectral interpolation described in previous chapters. The reasons for using

this spectral interpolation algorithm were discussed in chapter 6.1. The method used for

synthesis is now derived.

In the previous section, only the frequencies and the magnitudes were stored in the form

of matrices. The phase associated with each frequency was discarded. In this section, it is

assumed for the sake of generality that the phase associated with each frequency is also

known and stored. It is easy to modify the algorithm for the case where phase is not taken

into account. The parameters associated with each frequency are the frequency in radians,

the magnitude and the phase, denoted by w , A , and θ respectively. Assuming that

interpolation is to be performed between frequency l in the kth and the (k+1)th track, the

parameters associated with these frequencies are ),,( kl

kl

kl wA θ and ),,( 111 +++ k

lkl

kl wA θ

respectively. A linearly interpolated envelope is used to solve the magnitude interpolation

problem. In other words, the magnitude is interpolated from frame to frame using

equation 6.3.1

nS

AAAnA

kkk .

)()(

1 −+=+

, 1,....,1,0 −= Sn (6.3.1)

Here n is the time sample into the kth frame and S is the total number of samples per

frame. (The track subscript l has been omitted for convenience). Unfortunately this

simple approach cannot be used to interpolate between frequencies due to the

discontinuities which occur as described in chapter 6.1. Also the measured phase is

modulo π2 and hence phase unwrapping must be performed to ensure that frequency

88

tracks are “maximally smooth.” Since a cubic equation is used for interpolating phase,

the first step is to propose a function for phase that is a cubic polynomial.

( ) 32 ... tttt βαγζθ +++= (6.3.2)

It is convenient to treat the phase as a function of the continuous time variable t with

respect to some given frame. This phase function must vary smoothly with time so that

the output for that track which is given by equation 6.3.3 also varies with time as

smoothly as possible.

)](cos[)( ttstrack θ= (6.3.3)

It is important to note at this point that phase is directly related to instantaneous

frequency by the relation in equation 6.3.4.

dt

tdtf

)]([)(

θ= (6.3.4)

where )(tf denotes instantaneous frequency.

In other words, instantaneous frequency is a derivative of phase. Using this relation and

applying it to equation 6.3.2, the following is obtained:

2..3..2)( tttw βαγ ++= (6.3.5)

At the starting point of the frame, where t=0,

kθζθ ==)0(

(6.3.6)

kww === λθ )0()0(!

89

Here θ denotes phase and θ! denotes the first derivative of phase. In other words, the

starting phase of a given frame is the phase that is associated with that frequency for that

frame and it is equivalent to the variable ζ . The starting frequency of that frame which is

actually the frequency detected in that frame is equivalent to the parameter λ . These two

variables can therefore be set directly with the known information. The variables α and

β now need to be solved for. Consider the terminal point of the frame where t=T:

MTTTwT kkk ..2...)( 132 πθβαθθ +=+++= + (6.3.7)

12..3..2)()( +=++== kk wTTwTwT βαθ! (6.3.8)

Equation 6.3.7 is based on the fact that at the end of the frame the phase should be equal

to the total unwrapped phase found for the next frame. The unwrapped phase is actually

equivalent to the phase found which is 1+kθ added to some integer multiple of π.2 which

directly related to how many cycles were completed in that frame for that track. Equation

6.3.8 is based on the fact that at the end of the frame, the instantaneous frequency should

be equal to the beginning frequency of the next frame. There are now two equations in α

and β , and the only variable which is still unknown is M which basically stands for the

number of cycles completed by the frequency track in that frame. At this point, M is

unknown, but the equation can be solved for any given value of M using:

−+−−

−

−−

=

+

+

kk

kkk

ww

MTw

TT

TTM

M1

1

23

2 ..2.

1,

2

1,

3

)(

)( πθθβα

(6.3.9)

Any M that is suitable can be used to solve for all the required parameters. Some

constraint must be applied to solve for a value of M that will suit the algorithm. At

present there are a family of curves that produce the required result in terms of starting

and ending frequencies and phases. Since, interpolation between frames needs to be as

smooth as possible, the criterion of “maximal smoothness” is used to solve for M. It

90

seems obvious that the best phase function to choose would be the one that resulted in

least variation in frequency. Therefore a reasonable criterion for “smoothness” of phase is

that equation 6.3.10 is minimized.

∫=T

dtMtMf0

2)]:([)( θ!! (6.3.10)

Here ):( Mtθ!! denotes the second derivative of the phase ):( Mtθ with respect to the

time variable t and is an indicator of rate of change of frequency. The right hand side of

the equation basically finds the area under the curve of the square of the rate of change of

frequency for that frame for a given value of M. When this area is minimized, the

variation in frequency over the whole frame is minimized. The value of M that minimizes

this equation is chosen. Although M is integer valued, since )(Mf is quadratic in M, the

problem can easily be solved by minimizing )(xf with respect to the continuous variable

x and then rounding it off to the closest integer to get M. It can be shown that equation

6.3.10 can be minimized using the following:

−+−+= ++

2).().(

.2

1 11 TwwTwx kkkkk θθ

π(6.3.11)

M is determined by rounding off x to the closest integer and substituted in equation 6.3.9

to solve for )(Mα and )(Mβ which are basically α and β values for that particular

value of M. Having solved for α and β , and knowing ζ and λ as described above,

these coefficients are now used to construct the phase function for that frame using:

32 ).().(.)( tMtMtwt kk βαθθ +++= (6.3.12)

This phase function not only satisfies all the measured phase and frequency end point

constraints but also creates a phase function that is maximally smooth. In chapter 6.2, the

topic of inserting a frequency kw in the previous frame when a frequency 1+kw is born in

91

some frame and inserting it in the next frame when it dies was briefly discussed. This is

done by setting up the frequency kw = 1+kw and setting amplitude in that frame to zero

( )0=kA . The initial phase is calculated for this frame to ensure that the phase constraints

are satisfied at the end of this frame or the beginning of the next frame. Since only the

magnitude has to rise but the frequency has to stay constant in this initial frame, the value

of starting phase is calculated, which forces that to happen. Though the starting and

ending frequencies of this frame are the same, the algorithm could introduce slight

changes in frequency over the course of the frame to satisfy some phase constraints. The

value of phase, which will introduce no change in frequency over the course of the frame,

is calculated by using:

Swkkk .11 ++ −=θθ , (6.3.13)

where S is the number of samples per frame. Every frequency track (described in chapter

6.1) in the given matched matrix can be synthesized and then added up to give the final

output using:

∑=

=L

lll nnAns

1

)](cos[).()( θ (6.3.14)

where )(nAl is estimated from equation 6.3.1, )(nlθ is calculated using equation 6.3.12

and L is the number of frequency tracks in that matrix.

In this study, the phase information is not important and certain alterations to this

algorithm can be made. Since phase information is not used, the value of γ used for the

synthesis of each frame is not important. γ could possibly be set to 0 for each frame

thereby forcing the phase to be zero at the beginning of each frame, but this does not

offer any advantages as such. To better understand how the value of γ so that it is useful

in this study, consider the effect of using different values of M. Figure 6.3.1 shows the

variation of phase over the course of a frame for different values of M.

92

It should be noted that all the curves shown in the figure satisfy the required conditions of

starting and ending phases as well as starting and ending frequencies. They all also result

in smooth curves without discontinuities at frame boundaries. The only difference

between these curves is the variation of frequency during the course of the frame.

Depending on the value of M, some of the curves have a large variation in frequency in

the course of the frame (e.g. the case where M = 0 in figure.6.3.1) and some have very

little variation (e.g. the case of M = 2 in figure 6.3.1). The figure shows the variation in

phase, but since instantaneous frequency is just a derivative of phase, an almost linear

phase curve indicates very little change in frequency over the course of the frame.

Figure 6.3.1 Variation of phase over one frame for different values of M using cubic

phase interpolation

(Taken from McCaulay & Quatieri, 1986)

When x is calculated as shown in equation 6.3.11, a value is obtained, which if

substituted for M in equation 6.3.9, produces values of α and β which give the

smoothest or most linear curve in phase which in turn translates to the smoothest curve in

93

frequency. This value could not be used previously because M was required to be an

integer so that the starting and ending phases required by the stored phase information

could be obtained. So, x was rounded off to the closest integer to get M. Now that there

are no phase concerns, the value x is used without round-off to get the smoothest

possible curve. The starting and ending phases need careful handling though. In the very

first frame, a starting phase of zero is assigned. Since a non-integer value x is used,

instead of M, a non-zero-ending phase is obtained for the first frame. The signal is

synthesized so that the starting phase of the next frame is assigned to the ending phase of

the first frame. This is done in every subsequent frame. In this manner, there are no

discontinuities and the smoothest possible phase curve as well as frequency curve over all

the frames is obtained. The final synthesis is done in the same manner as before, using

equations 6.3.12 and 6.3.14. A model for additive synthesis was shown in figure 2.5.3.

With this method, a slightly changed model is proposed in which the frequencies and

amplitudes for each block are not fixed. The amplitude envelope varies in a linear fashion

for each frame according to equation 6.3.1 and the frequencies vary smoothly using the

cubic phase interpolation algorithm. The output is the sum of all the tracks generated.

Also to be noted is the fact that each sub-band is synthesized separately using the model

shown above. The final output is the sum of all the re-synthesized sub-bands. If the

synthesized sub-bands are y1, y2,…, y6, the total output is given by:

)(6)(5)(4)(3)(2)(1)( nynynynynynyny +++++= (6.3.15)

)(ny is the re-synthesized output signal.

94

Chapter 7 – Results and Conclusions

The methods described in previous chapters were used to analyze and represent various

kinds of audio signals. To test the effectiveness of this representation, re-synthesis was

performed so that the original version could be compared to the synthesized version. The

algorithm allowed flexibility for setting up some of the parameters. The number of sub-

bands into which the signal was divided was fixed at six. The window lengths within a

sub-band were not variable, but each sub-band could be set-up to use any desired window

length (as long as it was in the form n210× ms). Also the maximum number of peaks

within a sub-band could be fixed according to the type of audio signal being analyzed

(found empirically) and the amount of compression required. Another parameter that

could be set was the number of iterations performed during adaptation and this could be

set to a different value for different sub-bands (depending on the frequency resolution of

the human ear). Different settings revealed different results for different types of audio

signals. Four audio signals were analyzed and re-synthesized. They were as follows:

a) A classical music piece

b) A guitar chord

c) A clarinet patch from a synthesizer

d) Speech

A significant assumption in the analysis/synthesis procedure was the fact that the phase

information found during analysis was not required during synthesis and was therefore

discarded. In order to validate this, listening tests were performed on seven subjects to

determine the effects of including and omitting phase. The results of these listening tests

are shown in chapter 7.5. Also, as mentioned in previous chapters, one of the important

improvements in this algorithm was that of using frequency resolution of the human ear

to reduce the number of iterations performed and thereby reducing computational

complexity. The frequency resolution of the human ear though, was calculated using

graphs that showed frequency resolution of the ear for single tones. Practically these

95

values could be different when the ear is presented with complex tones. Listening tests

were performed with varying numbers of iterations to verify that the values used were

good estimates. The results are shown in chapter 7.5.

The remaining parameters of ‘block size’ and ‘maximum number of frequency peaks’

were determined empirically to produce the best results for the given data. The

parameters used for each audio signal as well as the results of analysis/re-synthesis are

now provided. Also, the details of the computational power that was saved as compared

to the Matt Kotvis algorithm are provided. The main reasons why computational

complexity was reduced were that matrix inversion was avoided as well as the fact that

the number of iterations performed while adapting to a peak was reduced. It should be

noted that each iteration involves sampling the DTFT of the signal three times. The

amount of computational power required for matrix inversion depends on how well the

process is optimized. The exact numbers of cycles saved are therefore not calculated but

the details of the processes saved are listed below.

7.1 – CLASSICAL MUSIC

The piece of music chosen for this category was a violin concerto by Bruch (track 5). The

set of parameters that gave the best results is shown in table 7.1.1.

Block Block Size

(ms)

Block Size

(samples)

Iterations Max. Frequency

Peaks

0 – 500 Hz 80 80 5 6

500 – 1000 Hz 80 80 4 6

1 – 2 kHz 40 80 3 8

2 – 4 kHz 40 160 2 12

4 – 8 kHz 40 320 2 14

8 – 16 kHz 20 320 3 -

Table 7.1.1 – Parameters used for classical music

96

These were the values selected at the beginning of the algorithm. The compression ratio

can be calculated by comparing the number of samples stored in the time-frequency

matrix for one second with the number of samples in time stored for one second in the

original signal. Since the sampling rate is 32 kHz, the number of samples per second in

the original signal is 32000. In a given sub-band of the time-frequency matrix:

Number of frequencies per second < (max frequency peaks)x(blocks per second) (7.1.1)

The ‘<’ operator is used because the psychoacoustic model usually discards some data,

but it is not possible to know exactly how much beforehand. The total number of

frequencies stored per second is calculated as the sum of the maximum number of

frequencies stored per second for each block.

Figure 7.1.1 (a) The original music signal and (b) the synthesized signal

Total frequencies/sec = (6x12.5) + (6x12.5) + (8X25) + (12x25) + (14x50)

= 1350

There is also a separate magnitude matrix containing the magnitudes of these frequencies,

which contains as many magnitudes as frequencies. So the total number of samples

required to be stored per second is double the above number, which is 2700.

97

Compression ratio = 32000/2700

Compression Ratio > 11:1

A compression ratio greater than 1:11 can be achieved with real recorded music. The

comparison between the original signal and the synthesized signal is shown in figures

7.1.1a and b.

Computational Power Saved

Matrix inversions saved per second: 50 80x80 matrices, 25 160x160 matrices, and 25

320x320 matrices

Iterations saved per second: 6575

7.2 – GUITAR CHORD

A clean electric guitar playing a chord was recorded via microphone. This piece was

analyzed and re-synthesized. The parameters used are shown in table 7.2.1.:

Block Block Size

(ms)

Block Size

(samples)


Peaks

0 – 500 Hz 40 40 5 5

500 – 1000 Hz 40 40 4 6

1 – 2 kHz 40 80 3 8

2 – 4 kHz 40 160 2 12

4 – 8 kHz 40 320 2 14

8 – 16 kHz 20 320 3 -

Table 7.2.1 – Parameters used for the guitar chord

Compression ratio > 32000/(5+6+8+12+14)(2)(25)

Compression ratio > 14:1

The results for analysis and re-synthesis of the guitar chord are shown in figures 7.2.1a

and b.

98

Figure 7.2.1 (a) Original signal and (b) Synthesized signal


Matrix inversions saved per second: 50 40x40 matrices, 25 80x80 matrices, 25 160x160

matrices, and 25 320x320 matrices


7.3 CLARINET PATCH FROM SYNTHESIZER

A clarinet patch was selected from a Korg X1D synthesizer. It was a single note played

for approximately 1 second and the parameters used are shown in table 7.3.1:

Block Block Size

(ms)

Block Size

(samples)


Peaks

0 – 500 Hz 40 40 5 3

500 – 1000 Hz 40 40 4 4

1 – 2 kHz 40 80 3 7

2 – 4 kHz 40 160 2 9

4 – 8 kHz 40 320 2 9

8 – 16 kHz 20 320 3 -

Table 7.3.1 - Parameters used for the clarinet patch

99



The results for analysis and re-synthesis of the clarinet patch are shown in figures 7.3.1a

and b.

Figure 7.3.1 (a) Clarinet original signal and (b) the synthesized signal


Matrix inversions saved per second: 50 40x40 matrices, 25 80x80 matrices, 25 160x160

matrices, and 25 320x320 matrices


7.4 SPEECH

For the case of speech, three different sets of compression levels were achieved by

varying the maximum number of frequency peaks allowed in each case. The speech

sample used was the sentence “Check this out” spoken by the author. Table 7.4.1 shows

the parameters used for speech.

100

Max. Frequency peaksBlock Block Size

(ms)

Block Size

(samples)

Iterations

Set 1 Set 2 Set 3

0 – 500 Hz 20 20 5 4 4 3

500 – 1000 Hz 20 20 4 8 8 5

1 – 2 kHz 40 80 3 12 12 6

2 – 4 kHz 40 160 2 10 10 6

4 – 8 kHz 40 320 2 14 - -

8 – 16 kHz 20 320 3 - - -

Table 7.4.1 – Parameters used for speech

Set 1



Set 2

Compression ratio > 32000/(4+8+12+10)(2)(25)


Set 3

Compression ratio > 32000/(3+5+6+6)(2)(25)


On listening to both the original and synthesized version it was found that the quality of

the synthesized version was quite good except for a hollow reverberant quality to the

synthesized version. The more the compression that was required, the more prominent

this reverberant quality became up to a point where the speech sounded very unnatural

but at the same time, quite intelligible. The results for the first set of parameters (using

the highest number of frequency peaks) are shown in figures 7.4.1a and b.

101

Figure 7.4.1 (a) Original speech signal and (b) the synthesized version (set 1)

Computational Power Saved (Set 1)

Matrix inversions saved per second: 100 20x20 matrices, 25 80x80 matrices, 25

160x160 matrices, and 25 320x320 matrices

Iterations saved per second: 10300,

7.5 – LISTENING TESTS

Listening tests were performed to test the validity of two assumptions made in this study.

The first assumption was that the frequency resolution curve that was used to calculate

the number of iterations held good for complex audio signals. The second assumption

was that phase information was not important perceptually for the sinusoidal model used

and was therefore discarded.

Frequency Resolution:

To test the validity of the first assumption, the signal was synthesized 5 times using 5

different numbers of iterations performed. The number of iterations performed originally

102

for each signal is shown in tables 7.1.1, 7.2.1, 7.3.1, and 7.4.1. For each of the four audio

signals in the above chapters, the signal was re-synthesized with an additional +2, +1, 0,

-1, and -2 iterations for each band shown in the tables. The objective was to test if

subjects were able to identify a difference in quality if more or less iterations were

performed than the ones calculated from the frequency resolution curve (chapter 5.3).

ABX tests were conducted where one of the signals was always the one with the highest

number of iterations (+ 2 iterations) and the other was the same signal synthesized with a

lower number of iterations (given in the column marked iterations). The results are

shown in the following tables. The ‘X’s mark correct answers by that particular subject.

CHORD

Iterations Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7

+1 - - - - - - -

+0 - - - - - - -

-1 X - X - - - X

-2 - - X X - - X

CLARINET


+1 - - X - X - -

+0 - - - - - - -

-1 - - X - X X -

-2 - - - - X - -

CLASSICAL


+1 - - - - - - -

+0 - - - - - - -

-1 - - - - - - -

-2 - - - - X - -

103

SPEECH


+1 - - - - - - -

+0 - - - - - - -

-1 - - - - - - -

-2 - - - - - - -

Table 7.5.1 – Results of listening tests for frequency resolution

From table 7.5.1, the total number of correct responses for all four audio signals are now

shown for different numbers of iterations in comparison to the case of +2 iterations in

table 7.5.2. Responses are shown as a ratio of the number of correct responses to the

highest possible number of correct responses.

Number of iterations Number of correct responses

+1 2/35

+0 0/35

-1 6/35

-2 5/35

Table 7.5.2 – Summed up results of listening tests for frequency resolution

It can be seen from the table that for lower numbers of iterations, a higher number of

subjects tended to answer correctly. For the case where the number of iterations were as

calculated from the graph (+0), there were no correct answers. This suggests that for the

control group used, the number of iterations used originally was sufficient.

Phase:

To test the effects of including phase and using it to re-synthesize the signal, listening

tests were performed. The same seven subjects were presented with the original signal, a

re-synthesized version without using phase and a re-synthesized version, using phase.

This was done for all four signals and subjects were asked to compare the two re-

104

synthesized versions and state which one they thought was better (or closer in quality to

the original). There was also an option to mark neither if they could not tell the

difference. The results are now presented in table 7.5.3. In each case the total number of

subjects who marked that choice are shown.

Subject Version without

phase preferred

Version with phase

preferred

Neither preferred

Chord 4/7 2/7 1/7

Clarinet 7/7 0/7 0/7

Classical 3/7 1/7 3/7

Speech 6/7 0/7 1/7

Total 20/28 3/28 5/28

Table 7.5.3 - Results of listening tests for phase

From the table, it is seen that a majority of the control group preferred re-synthesized

versions without phase. The reason for this is that re-synthesized versions that used

phase, tended to be slightly off-pitch because of the following reasons. All sub-bands

were processed using short time frames. For lower frequencies (e.g. 100 Hz), the number

of cycles completed in that short time frame (e.g. 40ms) was very small (e.g. 4 cycles). If

phase information was used during re-synthesis (e.g. 0 radians in the first frame to pi/2

radians in the second frame), then the number of cycles required to be completed in that

time frame would be slightly different (e.g. 4.5 cycles instead of 4 cycles) in order to

maintain the phase relationships. But for lower frequencies and short windows, this

results in a slight pitch shift because it effectively changes the number of cycles per

second (frequency) being generated. Since the higher frequencies are not affected much

by this problem, they stay intact. This results in an overall off-tune effect. In general, the

results of this listening test suggest that it may be best to ignore phase in the particular

method of analysis/re-synthesis used in this study.

105

7.6 – CONCLUSIONS

The sinusoidal model for representing audio signals used in this study is found to be

useful for representing certain types of audio signals. From the results, it is found that for

audio signals that are not very complex such as the guitar chord, the synthesizer patch,

and to a certain extent the speech case, it seems to perform quite well. In fact the results

for the synthesizer patch were excellent. On the other hand, for audio signals that were

quite complex in terms of having a large number of frequencies, the algorithm did not

perform as well as expected. It is concluded that the reason for this is the problem of

nearby frequencies interfering with each other and creating shifts in the frequency peaks

and also resulting in spurious peaks. The reason in turn for this is the fact that the signal

is being analyzed in short segments or windows, which result in the actual spectrum

being convolved with the window spectrum. In this case, the window is a rectangular

window, which seems to give the best results because though there are high side-lobes,

the main-lobe width is small. The more the numbers of frequencies present in a signal,

and particularly, the closer they are spaced, the worse the performance of the algorithm

is. This is due to the reasons stated above.

Though the procedure of dividing the signal into sub-bands has its own merits, there is

one problem that occurs due to it. When dividing the signal into sub-bands, the first step

is to use the filter bank, which consists of a bank of complimentary high pass and low

pass filters combined with down-sampling operators as shown in figure 5.3.1.

Effectively, the signal is divided into bands by high pass filtering as well as low pass

filtering at frequencies of 8 kHz, 4kHz, 2kHz, 1kHz and 500 Hz. The filter used for these

operations is a high order elliptic filter. The filter is designed to have very high

attenuation at these frequencies in order to avoid aliasing problems. Subsequently, if

these frequencies are present in the given audio signal, they are highly attenuated. As an

example, when a guitar chord which has it’s fundamental note at 500 Hz is analyzed and

re-synthesized, the results are very poor. The problem is especially bad because the

frequencies are correlated in a musical sense. Since they are all an octave above each

106

other, they correspond to different octave levels of the same note. As a result, if this note

is present in the music, it is reproduced very poorly.

7.7 – IMPROVEMENTS

The method can be used with more success if careful research is done on the window (or

frame) lengths required for different types of signals. In general, a window length, which

is as small as possible without damaging the spectral characteristics of the windowed

signal, is desirable. If some optimization procedure is used to adapt to a signal and

determine an optimum window length for its various sub-bands, the algorithm should

perform better.

Also, variable window length within a sub-band depending on the characteristics of the

signal during that period of time would result in an improvement in performance. In this

method, when the signal is fairly stationary, long windows would be used and when it is

undergoing rapid changes, the algorithm would shift to shorter windows.

Another factor worth researching is the type of window to be applied on each frame. In

this study, the rectangular window was used because in comparison to other conventional

windows such as the Hamming window and the Hann window, it gave better results. It is

possible though that depending on the nature of the signal, different types of windows

would give different results. A method could be developed which adapts to the nature of

the signal and uses the most suitable window.

One problem related to compression issues and also processing time is the fact that in

higher octaves, more frequencies are adapted to. This is because the actual range of

frequencies in higher octaves is broader and more frequencies are required to make the

synthesized version sound more natural. This contradicts the notion that most of the

information is contained in lower and mid octaves up to around 4 kHz and therefore these

should be analyzed with the highest number of frequency peaks. In the higher octaves, at

107

least for the case of music, the information seems to be more like shaped white noise (e.g.

cymbals). If the higher octaves could be modeled using this shaped white noise model

instead of the pure sinusoidal model used in the lower octaves, better results and more

compression could be obtained due to reduced frequency peaks (used only for shaping

the noise).

7.8 – SUMMARY

The aim of the study was to develop a method to decompose audio signals in the form of

their individual frequency components and show their variation in time. The application

explored was audio compression. The main problem in the developed method was the

fact that a finite length window in time always creates spectral spreading. Among all the

parameters mentioned in the various tables shown in this chapter, block length (window

length) was probably the most crucial factor. If it was too large, the reconstructed signal

tended to sound smeared in time, with no transients. If it was too small, the frequencies

used to reconstruct the piece were off because of increased spectral spreading.

The main advantage offered by this algorithm (in addition to the greatly reduced

computations) is the fact that window lengths are adjustable, which was not the case in

earlier work. Other parameters such as maximum number of frequency peaks, number of

iterations performed, type of window used and the actual psychoacoustic model used can

also be changed as required. The algorithm is set up perfectly for further research into

finding optimum values for these parameters. Once these parameter values are optimized

and a balance between the spectral spreading problem and the time resolution problem is

achieved, the actual time-frequency representation will approach the ideal case. It will

then be possible to alter the characteristics of the signal in an incredible number of ways

by simply altering any component in the desired way. Applications such as pitch shifting,

time compression and expansion, and even pitch correction could be easily implemented.

In conclusion, the next step towards improving this method is to automate the process of

finding optimum values for the above described parameters. Once the time-frequency

representation is as good as possible, any related application becomes trivial.

108

APPENDIX A

The transfer functions used for the low pass and high pass filters are now given.

Low-Pass Filter

7654321

7654321

1200.05652.04616.15859.22124.31311.38692.11

0077.00408.0103.01581.01581.0103.00408.00077.0)( −−−−−−−

−−−−−−−

−+−+−+−+++++++

=zzzzzzz

zzzzzzzzH

High-Pass Filter

7654321

7654321

1200.05652.04616.15859.22124.31311.38692.11

0077.00408.0103.01581.01581.0103.00408.00077.0)( −−−−−−−

−−−−−−−

+++++++−+−+−+−=

zzzzzzz

zzzzzzzzH

109

References:

1) Cohen, L. (1995), Time-Frequency Analysis, Prentice Hall PTR, New Jersey2) Lindquist, C.S. (1989), Adaptive and Digital Signal Processing, Steward & Sons,

Miami3) Proakis, J.G. & Manolakis D. (1988), Introduction to Digital Signal Processing,

Macmillan Publishing Company, New York4) Rabiner, L.R. & Schafer, R.W. (1978), Digital Processing of Speech Signals, Prentice

Hall Inc., Englewood Cliffs, New Jersey5) Vaidyanathan, P.P. (1993), Multirate Systems and Filter Banks, Prentice Hall Inc.,

Englewood Cliffs, New Jersey6) Pohlmann, K.C. (1995), Principles of Digital Audio, McGraw-Hill Inc., New York7) Dodge, C., Jerse, T.A. (1997), Computer Music, Simon & Schuster Macmillan, New

York8) Roederer, J.G. (1995), The Physics and Psychophysics of Music, Springer-Verlag,

New York9) Tobias, J.V. (1970), Foundations of Modern Auditory Theory, Academic Press Inc.,

New York10) Kotvis, M. (1997), An Adaptive Time-Frequency Distribution with Applications for

Audio Signal Separation, Masters Thesis, Dept. of Music Engineering, University ofMiami

11) Dologlou, I., Bakamidis, S., Carayannis, G., Signal Decomposition in Terms of Non-Orthogonal Sinusoidal Bases, Signal Processing, 51 (1996), p. 79 – 91

12) Hyuk Jeong & Jeong-Guon Ih, Implementation of a New Algorithm Using the STFTwith Variable Frequency Resolution for the Time-Frequency Model, J. Audio Eng.Soc,. 47, No. 4, (April 1999), p. 240 – 250

13) Portnoff, M.R., Time-Frequency Representation of Digital Signals and Systems Basedon Short-Time Fourier Analysis, IEEE Transactions on Acoustics, Speech and SignalProcessing, 28, No. 1, (Feb 1980), p. 55 – 69

14) Vaidyanathan, P.P., Multirate Digital Filters, Filter Banks, Polyphase Networks, andApplications: A Tutorial, Proceedings of the IEEE, 78, No. 1 (Jan 1990), p. 56 - 93

15) Mcaulay, R.J. & Quatieri, T.F., Magnitude-Only Reconstruction Using a SinusoidalSpeech Model, Proceedings on the IEEE International Conference on Acoustics,Speech and Signal Processing, 34 (1984), p. 27.6.1 – 27.6.2

16) Mcaulay, R.J. & Quatieri, T.F., Speech Analysis/Synthesis Based on a SinusoidalRepresentation, IEEE Transactions on Acoustics, Speech and Signal Processing, 34(1986), p. 744 – 754

17) Serra, M.H., Rubine, D., Dannenberg, R., Analysis and Synthesis of Tones by SpectralInterpolation, J. Audio Eng. Soc., 38, No. 3 (March 1990), p. 111 – 128

An Adaptive Time-Frequency Representation with Re ...Abhijeet Tambe (Master of Science, Music Engineering Technology) An Adaptive Time-Frequency Representation with Re-Synthesis Using

Documents