HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION …mue.music.miami.edu/thesis/Arvind_Venkata... · The Human auditory system is very sensitive to the mid frequency range (1-5

HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS

Master’s Thesis submitted to the faculty of University of Miami in partial

fulfillment of the requirements of the degree of Master of Science

by

Arvind Venkatasubramanian

Music Engineering Technology, Frost School of Music, University of Miami , P.O.Box 248165 Coral Gables, FL 33124-7610

May 2005 Research Advisor: Thesis Panel: Mr. Colby N. Leider Mr. Kenneth C. Pohlmann Assistant Professor, Director of Music Engineering Music Engineering Technology Frost School of Music Frost School of Music University of Miami, Coral Gables University of Miami, Coral Gables Dr. Edward P. Asmus Associate Dean, Graduate Music Studies, Frost School of music, University of Miami, Coral Gables

University of Miami

Master’s Thesis submitted to the faculty of University of Miami in partial fulfillment of the requirements of the degree of Master of

Science

HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS

Approved: _____________________________ Ken C. Pohlmann Director of Music Engineering _____________________________ Colby N. Leider Assistant professor of Music Engineering _____________________________ Dr. Edward P. Asmus Associate Dean of Graduate Studies

VENKATASUBRAMANIAN, ARVIND (M.S., Music Engineering) High-Fidelity, Analysis-Synthesis Data Rate (May 2005) Reduction for Audio Signals Abstract of the master’s research project at the University of Miami Research project Supervised by Assistant Professor Colby Leider Number of pages in text: 146 Powerful music-compression algorithms have facilitated greater audio data reduction

today. This paper is about a basic communication system that synthesizes part of the

audio information for which the human auditory system is not relatively sensitive and

adding them to audio information to which they are sensitive, thus maintaining the

originality of the signal in the sensitive region. One of the biggest advantages in writing

data-reduction algorithms of today is to avoid coding for music data, which humans do

not hear. The Fletcher-Munson curves depict the loudness perception scheme of the

human auditory system. The auditory system is very sensitive to the mid-frequency

region compared to low-frequency and high-frequency regions of the human hearing

range. In the proposed coder, the mid frequency PCM information from the audio data is

transmitted as such through the channel. This involves modulation at the transmitter end

to reduce the sampling rate through the channel and demodulation at receiver end to

recover back the message. A sinusoidal model is used to synthesize the audio data

corresponding to the low-frequency and high-frequency regions. The sinusoidal model

involves short time Fourier analysis that extracts meaningful parameters that are fed into

an oscillator at the synthesis end to reconstruct the sound. Therefore, modifications are

possible before resynthesis.

This method is two-filter method and the application is an alternate for the existing audio

data-reduction algorithms that rely on psychoacoustic models and perceptual coding.

Because no approximation is made to the signal that lies in the region of greatest

sensitivity, the application tends to be a perceptually transparent. Low-frequencies have

poor spectral resolution than high frequencies. Therefore the two-filter model was

improved by downsampling the input, causing a pitch shift. This enables the SMS

system [2] to track clear sinusoids and because the input is down sampled, the data and

computation cost are halved. Modifications are set to recover back the original time

length.

One another model called the four-filter method, which outperforms the partial

analysis/synthesis system (PSMS), was simulated based on the duplex theory of pitch

perception by using a variable time-frame length for different spectral bands. The four-

filter method was improved to one another level. In high-frequency regions, the ear

follows the amplitude envelope of the frequency spectrum and leaves out its phase

content. This psychological evidence was used in our sinusoidal model by discarding the

high-frequency phase parameters that are to be fed into the oscillator. The results showed

that the four-filter method is better than the two-filter method both in quality sense and in

data rate reduction. The high-frequencies synthesized without phase information

sounded same as the high-frequencies synthesized with phase. This showed that phase of

high-frequency signals do not have much perceptual importance.

DEDICATION

To human empiricism that protects humanity, accepting the pluralism, as a mark of respecting the uncertainty that rules my humbleness & humility.

v

ACKNOWLEDGEMENT I would like to thank my family, relatives, and friends. I would like to extend a word of

thanks to Dr.Srinivasan Narasimhan. I acknowledge the theses committee members for

their time. I would like to thank my academic mentor Ken Pohlmann, for being patient

giving me enough time, encouraging me at right times and for his understanding. I would

like to thank and appreciate my theses supervisor Colby Leider, for introducing me to

intellectual music and for guiding me through this research by contributing his ideas. His

knowledge in sound synthesis took me towards this final report, as without his guidance

this work would not be possible. His guidance helped me sail on right track till the end.

I extend my thankfulness to Dr.Asmus for his help to me during summer 2003. My

sincere thanks to all my high school friends and teachers, undergraduate friends and

teachers, Joe, Music Engineering friends and the Indian graduate friends at the UM. I

appreciate those who participated in listening tests. I’m thankful to Girish, Department

of Psychiatry and Department of Gerontology at the UM, who gave me a job. I thank Ali

Habashi for his help. I would like to thank Dr.Shariar Negadaripour (EE), Dr.Murat

Dogruel (EE), Dr.Micheal Scodrilis (EE), Dr.Don Wilson (Composition), Dr.Modestino

(EE), Dr.Mermelstein (EE) and Dr.Moeiz Tapia (EE) under whom I underwent course

works and class projects at the UM. These courses and projects have indirectly helped

me in this thesis.

vi

Table of Contents: Chapter 1: Introduction ...................................................................................................1 Chapter 2: Literature Review..........................................................................................7

• Pitch Perception .......................................................................................7

Psychoacoustics of music...........................................................................7 Ear mechanisms and human auditory system............................................9 Duplex theory of pitch perception............................................................14 Missing fundamental effect ......................................................................15 Virtual and Spectral pitch ........................................................................17

• Fletcher-Munson Curves.......................................................................23 • Basic Communication System...............................................................27

• Analysis-Resynthesis..............................................................................31

Fourier Philosophy: Fast Fourier Transform Analysis...........................32 Classical theory of timbre ........................................................................36 Overview of sound synthesis techniques ..................................................37 Spectral modeling synthesis .....................................................................43 MQ-Synthesis ...........................................................................................49 Bandwidth Enhanced Sinusoidal Modeling .............................................56

• Phase Synthesis.......................................................................................59

• Perceptual Coding V Partial Synthesis Based Data Reduction.........63 Chapter 3: The Research: A Partial synthesis based audio data reduction..............75

• On the use of pitch perception in data reduction................................75 Fletcher and Munson Curves: Data Sets .................................................78

• The General Procedure: Two-filter method.........................................83

Modulation and Demodulation: Mid-frequency band .............................85 FFT Analysis of low-frequency and high-frequency bands ......................89 Peak detection ...........................................................................................94 Peak Continuation ....................................................................................99 Additive Partial Sinusoidal Synthesis (PSMS)........................................108 Cubic Spline Interpolation ......................................................................109 Fusion of the sensitive and less sensitive data........................................112

vii

• Downsampling method .........................................................................113

• Duplex theory of pitch perception: Applications ...............................115

Improving Partial SMS: Four-filter method ...........................................115 Discarding HF Phase .............................................................................119 Possibilities in modifications ..................................................................122

• Advantages: Experimenting with a sine plus noise model ...............124

Chapter 4: Results: .......................................................................................................131

• Listening tests and results ...................................................................132

• Data reduction and results ..................................................................135 Chapter 5: Conclusion..................................................................................................139

• Future extension of the project...........................................................139

Pros and Cons: Perceptual coding V PSMS ........................................140 BIBLIOGRAPHY..........................................................................................................142 APPENDIX.....................................................................................................................144

viii

List of Figures: Figure 1.1: Overview of the proposed coder Figure 2.1: The second dimension of pitch: chroma Figure 2.2: The Missing fundamental effects Figure 2.3: Fletcher-Munson Curves Figure 2.4: Stages of Communication Figure 2.5: Basic communication system Figure 2.6: Types of communication Figure 2.7: Overview of general analysis and synthesis technique Figure 2.8: (a-c) Nyquist sampling theorem Figure 2.9: Additive Synthesis Figure 2.10: The amplitude progression of the partials of a trumpet tone Figure 2.11: SMS: Block diagram of the analysis process [2] Figure 2.12: SMS: Block diagram of the synthesis process [2] Figure 2.13: McAulay-Quatieri Sinusoidal Analysis-Synthesis system: [5] Figure 2.14: Mcaulay-Quatieri Sinusoidal model for speech [5] Figure 2.15: Peak detection in MQ-approach,: [5] Figure 2.16: Mcaulay-Quatieri Sinusoidal Analysis-Synthesis (Peak picking) [5] Figure 2.17: Lemur Figure 2.18: Lemur Graphical tool Figure 2.19: Bandwidth enhanced sinusoidal modeling Figure 2.20: MPEG audio compression and decompression Figure 3.1: Perceptual Coding approach Vs Synthesis based approach Figure 3.2: Fletcher Munson original curves (Fig 3 [1]) Figure 3.3: Figure 2 mentioned in [1] Figure 3.4: Figure 3 in [1] Figure 3.5: MALAB plot of figure 3.4 Figure 3.6: The Schematic Block Diagram employed in our Synthesis based Data Reduction (Two filter method) Figure 3.7: The Two Filter method: Band Pass and Band elimination filters Violin spectrum, FS = 44100, mono, 16 bit Figure 3.8: Modulation and demodulation of sensitive data (Country music, 44.1 kHz, 16

bit, mono at 705 Kbps) Figure 3.9: Original and Windowed short time signals, Fourier analysis (Hanning Window, 75% Overlap) Figure 3.10: Magnitude and Phase spectrum of LF and HF bands Figure 3.11: Peak detection in LF and HF bands Figure 3.12: Missed Peaks Figure 3.13: Peaks below threshold Figure 3.14: Peak Continuation process Figure 3.15: Crack Removal (Top) with cracks; (Bottom) without cracks Figure 3.16: Cubic spline interpolation: Crack Removal Figure 3.17: Spectral Fusion (A more general model)

ix

Figure 3.18: The downsampling method Figure 3.19: High Spectral resolution Four-filter method Figure 3.20: A schematic picture explaining how the analysis frame length is changed over frequency scale. Figure 3.21: The four filter method Figure 3.22: (Top) High resolution (Bottom) Low spectral resolution Analysis of spectral resolution of low frequency sounds (0-700 Hz) Figure 3.23: Synthesizing HF band with phase parameters and without phase parameters Figure 3.24: Plots explaining the synthesis of different band that have various analysis Frame length and their final fusion (Four filter method) Figure 3.25: Possibilities of Modifications Figure 3.26: Cross Effect using PSMS modifications Figure 3.27: Demonstration: Advantages of sine plus noise model in a two filter method Figure 4.1: Qualitative Results: Music Genre: Two-filter method Figure 4.2: Qualitative Results: Tonal Instruments: Two-filter method Figure 4.3: Qualitative Results: Percussion Instruments: Two-filter method Figure 4.4: Qualitative Results: Music Genre: Four-filter method Figure 4.5: Qualitative Results: Tonal Instruments: Four-filter method Figure 4.6: Qualitative Results: Percussion Instruments: Four-filter method Appendix: Figures: Rock music, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Country music, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Speech, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Chinese Pipa, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Gottuvadhyam and Tabla, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Sitar and Tabla, sampled at 44.1 kHz, 16 bit, mono, 705 kbps

x

3

List of Tables:

Table 2.1: Table 2.1: Critical Bandwidth as a function of center frequency and critical

Band rate: [8]

Table 3.1: Fletcher Munson curve: data sets

Table 4.1: Bit rates and audio compression ratio (Two-filter and four-filter method)

Table 5.1: Pros and cons: Perceptual coding V PSMS

xi

1

CHAPTER 1

INTRODUCTION

The aim of this research project is to present a communication system that encodes less

information so as to decode all the necessary information, at the same time, improving

the fidelity of existing synthesis-based data reduction algorithms. The primary motive of

data reduction is achieved by following a simple tactic.

The Human auditory system is very sensitive to the mid frequency range (1-5 kHz) of the

spectrum. Experiments show that critical bands are much narrower at low frequencies

than at high frequencies; three-fourths of the critical bands are below 5 kHz; the ear

receives more information from low and mid frequencies and less from high frequencies.

We transmit the audio PCM data which represent the mid-frequency spectrum above the

human threshold of hearing through the communication channel. A simple modulation

technique is followed by modulating the mid-frequency pass band to the base band and

downsampling in time. At the receiver end the modulated data is upsampled and then

demodulated so that the message is recovered back. This facilitates data reduction.

The Sinusoidal model of Xavier Serra [2] is used to synthesize the low-frequency and

high-frequency bands of the spectrum. The audio for the Low-frequency and high-

frequency regions of the human auditory range are synthesized and received at the

receiver end of the Communication System. A band elimination filter is used to

eliminate the sensitive frequency band for which humans are sensitive. The outputs of

the band-elimination filter comprising the less-sensitive low and high ends of the

2

spectrum become inputs to the sinusoidal model. A band pass filter is used to filter the

sensitive frequencies of the mid spectrum. Since two filters are used, this method is

called the two filter method. The Sinusoidal modeling involves a short-time Fourier

analysis of overlapping time frames. Each time frame is windowed before analysis. The

short-time Fourier transform gives the spectral details of the current frame. A simple

peak detection algorithm detects vital peaks i.e., local maxima in the spectrum. The

amplitudes of the peaks and their corresponding frequencies and phases will form the

inputs of the oscillator at the synthesis end. Before synthesis, a peak continuation

algorithm connects the spectral points (peaks) to form the sinusoids, commonly called the

tracks. The connected tracks are smoothly interpolated from frame to frame as said in the

sinusoidal modeling synthesis and Mcaulay-Quartieri algorithm. However, the peak

interpolation of frequency-domain parameters are replaced in this thesis work by a

smooth cubic spline interpolation of abruptly changing voltage levels between frames in

the time domain. This helps in reducing the computational expenses.

The stochastic signals in the sensitive mid spectral band would be transmitted as PCM

data. Therefore, the sinusoidal plus noise model was replaced with a mere sinusoidal

model in which even noise is modeled into tracks. However, a typical SMS is a sine plus

noise model. Shorter tracks are usually deleted and are stochastically analyzed by a

linear prediction method. A convincing nth-order polynomial fit of the stochastic

frequency response is possible at the synthesis end if we could send linear prediction

coefficients through the channel. White noise could be used as excitation into the linear

prediction filter to get a stochastic approximation of the noise. Stochastic modeling is not

3

included in this project. It is always better to model a deterministic plus stochastic model

because in real world cases, physical music signals are made up of sinusoids and musical

noise (excitation). Moreover, human beings do not follow exact phase information for

noisy transient sounds. They do not carry perceptually meaningful information in most

cases. Even though, stochastic model approach is not followed in this thesis, a

demonstration of advantages of including stochastic analysis in the two filter method is

discussed briefly.

Figure 1.1: Overview of the proposed coder

A logical note about this system is that the low frequencies have poor frequency

resolution compared to mid and high frequencies according to the duplex theory of pitch

perception. We use the SMS technique for synthesizing the LF region. Henceforth, false

sinusoidal trajectory connections and trajectory breaks are expected not to show up or not

to be audible as artifacts in the low frequencies of the output sound. Though this is a

logical conclusion, engineering is all about improving models. Hence, this research work

4

also introduces a new four filter method that models the sound better giving more hopes

and chances for high fidelity and tactical data reduction.

The four-filter method involves two filters in the low-frequency region, one in the mid-

(sensitive region) and one in the high-frequency region. Based on the frequency location

of the local spectral band, the frame length for analysis is changed in the time domain

that would result in appropriate frequency resolution in the frequency domain. Low

frequencies have poor spectral resolution. The auditory system follows the time impulses

for low frequencies. Hence, the analysis time frame length is set to a small value to

capture time resolution and this value increases for each filter, as we move to the high-

frequency end. This not only results in smoother sinusoidal connections but also

manages to save the amount of parameters that has to be sent through the channel for

proper sound reconstruction.

The four-filter method was improved to one another level. In HF regions, the ear follows

the amplitude envelope of the frequency spectrum and leaves out its phase content. This

psychological evidence was used in our sinusoidal model by discarding the HF phase

parameters that are to be fed into the oscillator.

The other way to improve the system is to shift the low frequency region way up on the

frequency scale by a downsampling in time by a factor two. The sinusoids would have

better resolution now and the trajectory-tracking will be smooth and comfortable. At the

same time the duration of the music will be halved. This will reduce the number of

5

parameters by a factor two. After getting the parameters, the frequency parameters are

divided by a factor two and sent through the channel. Modifications are applied to time-

stretch the signal back to original length. If the downsampling factor becomes greater,

the musician could hear the “warbling” effect, which is undesirable artifact in this case.

The chapter three contains more information on the sinusoidal trajectory tracking.

In this complex world, perfection is never achieved. This applies to this project too. The

success level of this algorithm is based on how close the systems audio output is when

compared to original audio. The complexity, in getting near to physical world’s real

sounds, is very difficult to represent, even when we break the complexity into separate

dimensions of time, frequency and amplitude. To represent the complexity, we need the

magnitude response followed by phase response of the systems. Therefore, the success

of this synthesis based music data reduction is all on how close the synthesized music is

to the original music.

The synthesis methods we used here are the Music sinusoidal plus noise modeling based

on the past research work by Xavier Serra [2]. Sounds produced by musical instruments

and other physical systems can be modeled as a sum of deterministic and stochastic parts,

or as a sum of sinusoids plus noise residual. Sinusoidal are produced by a harmonic

vibrating system. The residual contains the energy produced by the excitation

mechanisms and other components which are not result of periodic vibration [3]. This

synthesis method is applicable only for musical purposes. A more general scheme that

fits any sound, sometimes even noise, is the Mcaulay-Quatieri algorithm [5]. Our system

6

in this research, works for any type of sound ranging from tonal to non tonal to noisy

transient sounds. Our system allocates more bits to the human sensitive signals. A

perceptual coder has to analyze a short time signal to adaptively allocate more bits to

meaningful music signal, code fewer bits to less meaningful non musical signal and does

not allocate bits for useless information. If low-bit-rate coding could be used to code the

mid-frequency sensitive signals, this research scheme could facilitate avoiding all

adaptive bit allocation computational cost because this becomes a direct choice for an

engineer to switch over from sensitive to less sensitive signal by a one step external

decision. Hence, if low-bit rate coding could be used for coding sensitive frequencies,

the possibilities for avoiding computational costs increase. At the same time, high

fidelity could be attained because the algorithm focuses on sensitive frequencies granting

bits liberally from the bit pool. However, this work will be reserved for the future. The

third chapter contains the research project which is more detailed.

7

CHAPTER 2

LITERATURE REVIEW

PSYCHOACOUSTICS OF MUSIC

Psychoacoustics is the study of human auditory perception, ranging from the biological

design of the ear to the brain’s interpretation of aural information. Sound is only an

academic concept without our perception of it. Psychoacoustics is the branch of study

that explains the subjective response to everything we hear. It is only our response to

sound that fundamentally matters. Psychoacoustics seeks to reconcile acoustical stimuli

and all the scientific, objective, and physical properties that surround them, with the

physiological and psychological responses evoked by them.

Psychoacoustics can be defined simply as the psychological study of hearing. The aim of

psychoacoustic research is to find out how hearing works. In other words, the aim is to

discover how sounds entering the ear are processed by the ear and the brain in order to

give the listener useful information about the world outside.

The ear and its associated nervous system is an enormously complex, interactive system.

The physiology of the human hearing system has evolved incredible powers of

perception. At the same time it has its limitations. The ear is astonishingly acute in its

ability to detect nuance or defect in a signal. It is also with portions of the signal that do

8

not have perceptual importance. The accuracy of a coded signal can be very low, but this

accuracy is very frequency-dependent and time-dependent.

The ear is a highly developed physical organ (the eye, for example can only receive

frequencies over one octave), but the ear is useful only when coupled to the interpretative

powers of the brain. Those mental judgments form the basis for everything we

experience from sound and music. The left and right ears do not differ physiologically in

their capacity for detecting sound, but their respective right and left brain halves do. The

two halves loosely divide the brain’s functions. [8]

PITCH AND PITCH PERCEPTION

Pitch refers to the tonal height of a sound object, e.g. a musical tone or the human voice.

The use of the term pitch is, however, often inconsistent in that the term is both used for a

stimulus parameter (i.e.,, synonymous to frequency) and for an attribute of auditory

sensation. The people concerned with processing of speech mostly use the term in the

former sense, meaning the fundamental frequency (oscillation frequency) of the glottal

oscillation (vibration of the vocal folds). In psychoacoustics (and so in the present

discussion) the term is used throughout in the latter sense, i.e., meaning an auditory

(subjective) attribute. The ANSI definition of psycho acoustical terminology says that

“pitch is that auditory attribute of sound according to which sounds can be ordered on a

scale from low to high”. To date this definition still is a useful basis, though it must be

complemented by taking account of certain additional aspects. [12]

9

EAR MECHANISMS IN PITCH PERCEPTION

The ear performs the transformation from acoustical energy to mechanical energy and

ultimately to the electrical impulses sent to the brain, where information contained in

sound is perceived. The outer ear collects sound and its intricate folds help us to assess

directionality. The ear canal resonates at about 3 kHz, providing extra sensitivity in the

frequency range critical for speech intelligibility. The eardrum transduces acoustical

energy into mechanical energy; it reaches maximum excursion at about 120 dB SPL,

above which it begins to distort the waveform. Three bones in the middle ear,

colloquially known as hammer, anvil and stirrup provide impedance-matching to

efficiently convey sounds in air to the fluid filled inner ear.

The coiled basilar membrane detects the amplitude and frequency of sound; those

vibrations are converted to electrical impulses and sent to the brain as neural information

along a bundle of nerve fibers. The brain decodes the period of stimulus and point of

maximum stimulation along the basilar membrane to determine frequency activity in

local regions surrounding the stimulus.

Examination of the basilar membrane shows that the ear contains roughly 30,000 hair

cells arranged in multiple rows along the basilar membrane, roughly 32 mm long; this is

the organ of corti. The cells detect local vibrations of the basilar membrane and convey

audio information to the brain via electrical impulses. Frequency discrimination dilates

that at low frequencies, tones a few Hertz apart can be distinguished; however at high

10

frequencies, tones must differ by hundreds of hertz. In any case, hair cells respond to the

strongest stimulation in their local region; this is called a critical band, a concept

introduced by Harvey Fletcher. Experiments show that critical bands are much narrower

at low frequencies than at high frequencies; three-fourths of the critical bands are below 5

KHz; the ear receives more information from low frequencies and less from high

frequencies. Critical bands are approximately 100 Hz wide for frequencies from 20 to

400 Hz and approximately 1/5 octave in width for frequencies from 1 to 7 KHz. Previous

research shows that critical bands can be approximated with the equation:

Critical Bandwidth in Hertz = 24.7(4.37F + 1)

Where F = center of frequency in kHz. [12]

The bark is the unit of perceptual frequency; a critical band has a width of one bark;

1/100 of a bark equals 1 Mel. The bark scale relates absolute frequency (in Hertz) to

perceptually measured frequencies such as pitch or critical bands. Using a bark scale, the

physical spectrum can be converted to a psychological spectrum. In this way, a pure tone

(a single spectrum line) can be represented as a psychological masking curve.

The pitch place theory further explains the action of the basilar membrane. Carried by

the surrounding fluid, a sound wave travels the length of the membrane and the wave

stops at particular places along the length of the membrane, where the greatest vibration

of the membrane occurs, corresponding to different frequencies. Specifically, high

frequencies are sensed at the membrane near the middle ear while low frequencies are

11

sensed at the farther end. The wave excited by a high-frequency sound does not reach the

far end of the basilar membrane. However, a low-frequency sound will pass through all

high frequency places to reach the far end. Because hair cells tend to vibrate at the

frequency of strongest stimulation, they will convey that frequency in a critical band,

ignoring lesser stimulation. This excitation curve is described by the cochlear spreading

function, an asymmetrical contour. Critical bands are important in perceptual coding

because they show that the ear discriminates between energy in the band, and energy

outside the band; in particular, this promotes masking. [8]

Perception of pitch is a complicated issue. The definitions that pitch is a sensory

(subjective) attribute tend toward a concept that includes not only the aspect of perceived

height but in addition one or even more aspects of tones that are relevant in music. The

most prominent of these additional aspects is octave-equivalence the notion that tones an

octave apart are somehow similar and so, in certain musical respects, "equivalent". Pitch

must be regarded as a two-dimensional attribute such that height is only one of two

dimensions. The second "dimension" ordinarily is termed chroma. According to this

concept a pitch is said to have both a certain height and a certain musical-categorical

value (chroma), e.g. "c-ness", "d-ness", etc. This is often illustrated by Roger Shepard’s

helical model. In that model, pitches are represented by points on an ascending helix such

that the vertical height of their position reflects pitch height, while the rotational angle of

the position corresponds to chroma. On the helix pitches with one and the same chroma

(in music denoted by the same letter, c, d, e, etc.) are situated vertically above or below

one another. Practically any sound of real life including the tones of musical instruments

12

evokes several pitches at a time, though often (in particular for the harmonic complex

tones produced by conventional musical instruments) one of them is most prominent and

then is said to be the pitch. So the weak point of the ANSI definition is that there is no

guarantee that any sound having pitch indeed can unambiguously be positioned on the

low-high dimension. The auditory system also gets confused when a C and the very G

are played simultaneously, one tone perceived by the right and one by the left ear. [12]

When one listens to a pair of successive musical tones, one can ordinarily tell whether or

not the tones are equal in pitch; or if the first is higher in pitch than the second; or vice

versa. However, even for ordinary musical tones there is octave equivalence, which

means that tones may be confused with one another although their oscillation frequencies

differ by a factor of two. This implies that for harmonic complex tones there exists a

certain ambiguity of pitch which naturally emerges from the multiplicity of pitch.

The ambiguity of pitch can be much amplified by suppressing certain harmonics from the

Fourier spectrum of a "natural" harmonic complex tone. Shepard has described

observations on harmonic complex tones whose Fourier spectrum consisted only of

harmonics that were in an octave relationship, i.e.,, the 1st, 2nd, 4th, 8th, 16th, etc. While

the musical pitch class (the chroma) of such tones is well defined, the absolute height of

pitch is quite ambiguous; that is, octave confusions are very likely to occur.

13

Figure 2.1: The second dimension of pitch: Chroma [12]

This is particularly true when the frequency of the first harmonic is near the lower limit

of the hearing range while the upper part tones extend up to the high end of the hearing

range. In that case, there indeed is little that if any information available to the ear about

what actually is the fundamental frequency (oscillation frequency).

When, for instance, the oscillation frequency of the above type of tone is 10 Hz and the

number of part tones chosen is 11, the listener is exposed to a spectrum of part tones with

the frequencies 10, 20, 40, 80, ... ,10240 Hz. When that tone is followed by another

having twice the oscillation frequency of the first, the listener gets exposed to 20, 40, 80,

till 20480 Hz, and it is not surprising that one will not perceive much of a difference, if

any. So, under these conditions there is "perfect" octave equivalence.

From this notions it is easy to understand that when the ratio between the oscillation

frequencies of the two tones is 1.414, the listener on first sight cannot be expected to be

14

able to tell whether the second tone is higher in pitch than the first or vice versa. The

tritone paradox, originates from the observation that listeners in fact do make fairly

consistent decisions on which of the two tones is higher in pitch, i.e.,, whether they heard

an upward or downward step of pitch. However, while the responses of individual

listeners are fairly consistent and reproduceable, different listeners may give opposite

responses. Moreover, the responses of individual listeners turn out to be dependent on the

absolute height of the oscillation frequencies. That is, when the listening experiment is

made with a base frequency of, e.g., 12 Hz instead of 10 Hz, the individual responses

may systematically change. This was regarded as a particularly "paradox" outcome.

The basic aspects of the tritone paradox can be fairly well explained by the theory of

virtual pitch. However, the theory cannot account for the observed individual

differences, as the factors governing those differences as yet are unknown. [12]

DUPLEX THEORY OF PITCH PERCEPTION

Generally the ear mechanism works differently for low-frequency and high-frequency

pitch perception. At very low frequencies, we may hear successive features of a

waveform so that it is not heard as having just one pitch. The ear follows the energy

envelope in the LF region on the time scale. It takes into account the number of time

bursts per second and hence pitch sensation is based on periodicity. For high-frequency

contents the ear takes the position of vibration along basilar membrane of the cochlea into

account. In HF regions, the ear follows the amplitude envelope of the frequency

spectrum and leaves out its phase content. The two mechanisms appear to be about

equally effective at a frequency around 640 Hz. This is popularly called the Duplex

15

theory of pitch perception. [7] For frequencies well above 1000 Hz, the pitch frequency is

heard only when the fundamental is actually present.

THE MISSING FUNADAMENTAL EFFECT

When two single-frequency tones are present in the air at the same time, they will

interfere with each other and produce a beat frequency. The beat frequency is equal to the

difference between the frequencies of the two tones and if it is in the mid-frequency

region, the human ear will perceive it as a third tone, called a “subjective tone" or

"difference tone".

When two sound waves of different frequency approach the ear, the alternating

constructive and destructive interference causes the sound to be alternatively soft and

loud – a phenomenon, that is called “beating”. The beat frequency is equal to the absolute

value of the difference in frequency of the two waves.

The subjective tones, which are produced by the beating of the various harmonics of the

sound of a musical instrument help to reinforce the pitch of the fundamental frequency.

Most musical instruments produce a fundamental frequency plus several higher tones,

that are whole-number multiples of the fundamental. The beat frequencies between the

successive harmonics constitute subjective tones that are at the same frequency as the

fundamental and therefore reinforce the sense of pitch of the fundamental note being

played. If the fundamental is 50 Hz and its two successive harmonics 150 Hz and 200 Hz

16

beat with each other, a 50 Hz is the resultant which equals the fundamental and hence

reinforces the pitch. If the lower harmonics are not produced because of the poor fidelity

or filtering of the sound reproduction equipment, humans still hear the tone as having the

pitch of the non-existent fundamental because of the presence of these beat frequencies.

This is called the missing fundamental effect. It plays an important role in sound

reproduction by preserving the sense of pitch (including the perception of melody) when

reproduced sound loses some of its lower frequencies. The presence of the beat

frequencies between the harmonics gives a strong sense of pitch.

Figure 2.2: The missing fundamental effect

Fletcher in his first paper proposed that the missing fundamental was re-created by non-

linearties in the mechanism of the ear. He soon abandoned this false conclusion and in

his second paper described experiments saying that “a tone must include three successive

partials in order to be heard as a musical tone, a tone that has the pitch of the

fundamental, whether or not the fundamental is present”. [7]

17

For fundamental frequencies of up to about 1400 Hz, the pitch of a complex tone is

determined by the second and higher harmonics and not by the fundamental, whereas

beyond this frequency the opposite holds; this is the case both for tones with harmonics

of equal amplitude and for tones with harmonics of which the amplitudes fall by 6

dB/octave. For fundamental frequencies of up to about 700 Hz, the pitch is determined

by the third and higher harmonics; for frequencies up to about 350 Hz, by the fourth and

higher harmonics [18].

SPECTRAL PITCH AND VIRTUAL PITCH

Pitch of sine tones with high probability is a "place pitch", i.e.,, is dependent on the place

of maximal excitation of the cochlear partition and eventually is a result of peripheral

auditory Fourier analysis. On the other hand, it was evident that the pitch of many types

of complex tone cannot be explained by that principle in particular, the pitch of harmonic

complex tones whose fundamental Fourier component is weak or entirely missing.

Attempts have been made to resolve this conflict by searching for a parameter of sound

and a mechanism that accounts both for the pitch of sine tones and of complex tones. The

search is a failure to this day. Besides the pitch of sine tones there is another type of

pitch, namely, virtual Pitch. The conceptual distinction between spectral pitch and virtual

pitch is that both spectral pitch and virtual pitch ultimately are dependent on aural Fourier

analysis; however, while any spectral pitch is conceived as immediately corresponding to

a spectral singularity, virtual pitch is modeled as being deduced from a set of spectral

pitches on another stage of auditory processing. The relationship between spectral pitch

18

and virtual pitch is analogous to that between primary and virtual visual contour in many

respects.

There is hardly any sound that does not elicit any spectral pitch at all. The harmonic

complex tones of real life, i.e.,, voiced speech and musical tones are aurally represented

by a number of spectral pitches that correspond to the lower harmonics. The formants of

speech vowels elicit corresponding spectral pitches. Even random sound signals often are

either "colorized" by spectral irregularities, which will give rise to steady spectral pitches,

or there can occur instantaneous irregularities in the short-term Fourier spectrum that

elicit spectral pitches of which both the height and the instant of occurrence are random.

When any real-life sound (e.g., foot step, knock at the door, splashing water, sound of a

car's engine, and fricative phoneme of speech) can be identified by ear, one can be sure

that - besides temporal structure - spectral pitch is involved. Spectral pitch is the most

important carrier of auditory information, as it is an element of higher-order, Gestalt-like

types of auditory percepts such as, e.g., the pitch of musical tones, the strike note of bells,

the root of musical chords, and the quality of a particular vowel.

The telephone channel does not distort the pitch of speech, although transmission is

confined to the frequency range from about 300 to about 3400 Hz. When we suppress

bass reproduction, we will notice that the fundamental becomes inaudible, while the

speaker's pitch continues to be well reproduced. The kind of pitch of the fundamental

that we may hear if the fundamental is strong enough, is the pitch of a sine tone; it is of

the spectral pitch type. The pitch that we ordinarily hear, however, is not dependent on

the fundamental is being audible; it is by the auditory system extracted from a range of

19

the Fourier spectrum that extends above the fundamental. The latter type of pitch is

termed virtual pitch.

A procedure was described for the automatic extraction of the various pitch percepts

which may be simultaneously evoked by complex tonal stimuli. The procedure is based

on the theory of virtual pitch, and in particular on the principle, that the whole pitch

percept is dependent both on analytic listening (yielding spectral pitch), and on holistic

perception (yielding virtual pitch). The more or less ambiguous pitch percept governed

by these two pitch modes is described by two pitch patterns: the spectral-pitch pattern,

and the virtual-pitch pattern. Each of these patterns consists of a number of pitch (height)

values and associated weights, which account for the relative prominence of every

individual pitch. The spectral-pitch pattern is constructed by spectral analysis, extraction

of tonal components, evaluation of masking effects (masking and pitch shifts), and

weighting according to the principle of spectral dominance. The virtual-pitch pattern is

obtained from the spectral-pitch pattern by an advanced algorithm of sub-harmonic

coincidence assessment.

It can be concluded that, as an attribute of auditory sensation, virtual pitch is

fundamentally different in type from spectral pitch. This conclusion is strongly suggested

by the fact that one can hear both types of pitch at a time, having the same height.

Evidently, it is possible to communicate one and the same pitch (in terms of pitch height)

through two drastically different perceptual "channels": Spectral pitch is communicated

immediately, i.e., by a Fourier component's frequency, while virtual pitch is

20

communicated by providing to the auditory system information about the oscillation

frequency of a complex signal that is implied in the Fourier spectrum as a whole.

Formation of virtual pitch can essentially be said to be a process of subharmonic

matching. The tonal aspects of any sound are primarily represented by a set of spectral

pitches, and pertinent virtual pitches are "inferred" on the basis of the presumption that in

any case they must be subharmonic to the spectral pitches. The virtual pitch mechanism

deals with both "harmonic" and "inharmonic" sounds as well, though internally it strictly

sticks to the presumption that each and every virtual-pitch candidate must be a

subharmonic of a spectral pitch.

Where the partials in a sound are harmonically related, but with the first member of the

series missing (for example, a sound with partials at 500Hz, 750Hz, 1000Hz, 1250Hz

etc.) a virtual pitch can be heard at 250Hz - the missing fundamental. Where the partials

are not exactly harmonic, a virtual pitch is still heard at about the same place, but the

exact frequency turns out to be determined in quite a complicated way by the frequencies

of the individual partials. No comprehensive rule for determining virtual pitch is yet

known. Extensive research has been done on virtual pitch and it has been proved that it is

not due to a simple explanation such as difference tones, but rather a side effect of the

human hearing mechanism. There is no doubt that the strike note of a bell is a virtual

pitch, as will be explained below. Virtual pitch effects often dominate spectral pitches,

for example, in bells the strike note is about an octave below the nominal even if the

tierce, only a minor third away, is very strong.

21

Frequencies of partials present in sounds can be measured with scientific instruments, or

spectrum analyzers. Pitches cannot be measured with instruments. They exist only in our

perception of a sound. Only a human listener can tell us the pitch of a sound - and

different listeners may disagree on the perceived pitch. [12]

Spectral pitch is defined as an elementary auditory object that immediately represents a

spectral singularity. The simplest and most prominent example is the pitch of a sine tone.

A virtual pitch is characterized by the presence of harmonics or near-harmonics. A

spectral pitch corresponds to individual audible pure-tone components. Most pitches

heard in normal sounds are virtual pitches and this is true whether the fundamental is

present in the spectrum or not. The crossover from virtual to spectral pitch to be about

800 Hz, but this method depends on selection of clear sinusoidal components in the

spectrum. This follows the duplex theory of pitch perception. [7]

A procedure for the schematic and automatic extraction of "fundamental pitch" from

complex tonal signals, such as voiced speech and music was developed by Ernst

Terhardt. While the aurally relevant "fundamental" of a complex signal cannot be defined

in purely mathematical terms, an existent model of virtual-pitch perception provided a

suitable basis [13]. The procedure comprised the formation of determinant spectral

pitches (which correspond to the frequencies of certain signal components), and the

deduction of virtual pitch (or "fundamental frequency") from those spectral pitches. The

latter deduction was accomplished by a principle of sub-harmonic matching, for whose

realization a simple, universal, and efficient algorithm was found. While the calculation

may be confined to the determination of "nominal" virtual pitch, certain typical auditory

22

phenomena, such as the influence of SPL, partial masking and interval stretch, were

accounted well, in which the 'true' virtual pitch was obtained [14]. An algorithm for

extraction of pitch and pitch salience from complex tonal signals is mentioned in [13].

The core idea behind the project presented here is to make use of this natural

phenomenon of Virtual pitch in audio data reduction. The human sound perception is not

sensitive to the detailed spectral shape or phase of non-periodic sounds. The sinusoidal

model used in this research work takes advantage of the human inability to perceive the

exact spectral shape of signals. Even phase of transient noisy signals has limited

perceptual importance. [3]

In pure tones, because the frequency composition is so simple, no distinction can be made

between three different properties of a tone; its fundamental frequency, its pitch and its

spectral balance. The case is different with complex tones. Let us take the three

properties in turn. The fundamental frequency of a tone is the frequency of the repetition

of the waveform. If the tone is composed of harmonics of a fundamental, this repetition

rate will be at the frequency of the fundamental regardless of whether the fundamental is

actually present. In harmonic tones, the perceived pitch of the tone is determined by the

fundamental of the series, despite the fact that it is not present. This phenomenon is

sometimes called “the perception of the missing fundamental”. Finally, the spectral

balance is the relative intensity of the higher and lower harmonics. This feature controls

our perception of the brightness of the tone [6].

We will make great use of these pitch-related concepts, especially the missing

fundamental effect and duplex theory of pitch perception in this research project in an

23

effective manner which follows in Chapter 3. While synthesizing the low-frequency and

high-frequency spectrum, our system might not follow all the exact harmonics and their

corresponding energy levels. The missing fundamental effect explained above therefore

is usefully applied here to mask the absence of missing partials in the case of music and

absence of formants in speech applications. Minor errors while tracking the sinusoids

hence would be less perceived. Only major tracking errors in SMS will clearly show up.

THE FLETCHER MUNSON CURVES, 1933

Figure 2.3: The Fletcher Munson Equal Loudness Curves (1933)

In 1933, Fletcher and Munson decided to gather some information about how we

perceive different frequencies at different amplitudes. They came up with the Equal

Loudness Contours or the Fletcher and Munson Curves. These curves give information

24

on the threshold of hearing at different frequencies and the apparent levels of equal

loudness at different frequencies.

The ear is not equally sensitive to all frequencies, particularly in the low and high

frequency ranges. The response to frequencies over the entire audio range has been

charted, originally by Fletcher and Munson in 1933, with later revisions by other authors,

like Robinson and Dadson, as a set of curves showing the Sound Pressure Levels (SPL)

of pure tone’s that are perceived as being equally loud. The curves are plotted for each 10

dB rise in level with the reference tone being at 1 kHz.

The curves are lowest in the range from 1 to 5 kHz, with a maximum dip around 3300

Hz, indicating that the ear is most sensitive to frequencies in this range. The intensity

level of higher or lower tones must be raised substantially in order to create the same

impression of loudness. The phon scale was devised to express this subjective impression

of loudness, since the decibel scale alone refers to actual sound pressure or sound

intensity levels.

Historically, the A, B, and C weighting networks on a Sound level meter were derived as

the inverse of the 40, 70 and 100 dB Fletcher-Munson curves and used to determine

Sound level. The lowest curve represents the threshold of hearing, the highest the

threshold of pain. The actual data sets and their requirement in this research will be

explained in detail in Chapter 3.

25

Dynamic Range

The instruments used to measure the magnitudes of sounds respond to changes in air

pressure. However, sound magnitudes are often specified in terms of intensity, which is

the sound energy transmitted per second (i.e.,, the power) through a unit area in a sound

field. For a medium such as air, there is a simple relationship between the pressure

variations of a plane sound wave in a free field (i.e, in the absence of reflected sound) and

the acoustic intensity; intensity is proportional to the square of the pressure variation. [17]

The difference in sound-pressure level between the saturation or overload level and the

background noise level of an acoustic or electro-acoustic system, measured in decibels.

This range may be expressed as a signal-to-noise ratio for maximum output. For a sound

or a signal, its dynamic range is the difference between the loudest and quietest portions.

The human hearing system has a dynamic range of about 120 dB between the threshold

of hearing and the threshold of pain.

The intensity level where a sound becomes just audible is the threshold of hearing. For a

continuous tone of between 2000 and 4000 Hertz, heard by a person with good hearing

acuity under laboratory conditions, this is 0.0002 dyne/cm2 Sound Pressure and is given

the reference level of 0 dB.

While 0 dB is the reference employed, the threshold of hearing varies considerably with

lower and higher frequencies. This curve is also called minimum audible field (MAF).

Alternate units for this reference level are: 2 x 10-4 microbar (µbar); 2 x 10-5 Newton/m2

(N/m2); 2 x 10-5 Pascal (Pa); 20 micro Pascal (µPa).

26

Threshold of pain is the intensity level of a loud sound which gives pain to the ear,

usually between 115 and 140 dB.

The square of the sound pressure is proportional to sound intensity. SPL can be

calculated in the same manner and is measured in Decibels.

SPL = 10 log (r/rref) 2 = 20 log (r/rref)

Where r is the given sound pressure and rref is the reference sound pressure.

Decibel is the unit of a logarithmic scale of power or intensity called the power level or

intensity level. The decibel is defined as one tenth of a bel where one bel represents a

difference in level between two intensities I1, I0 where one is ten times greater than the

other.

Intensity level = 10 log10 (I1 /I0) (dB)

Because of the very large range of Sound Intensity which the ear can accommodate, from

the loudest (1 watt/m2) to the quietest (10-12 watts/m2), it is convenient to express these

values as a function of powers of 10. The result of this logarithmic basis for the scale is

that increasing a sound intensity by a factor of 10 raises its level by 10 dB; increasing it

by a factor of 100 raises its level by 20 dB; by 1,000, 30 dB and so on. When two sound

sources of equal intensity or power are measured together, their combined intensity level

is 3 dB higher than the level of either separately. 0 dB is defined as the threshold of

hearing, and it is with reference to this internationally agreed upon quantity that decibel

measurements are made.

27

Phon is a unit used to describe the loudness level of a given sound or noise. The system is

based on equal loudness Contour, where 0 phons at 1,000 Hz are set at 0 decibels, and the

threshold of hearing at that frequency. The hearing threshold of 0 phons then lies along

the lowest equal loudness contour. If the intensity level at 1,000 Hz is raised to 20 dB, the

second curve is followed.

For the purpose of measuring sounds of different loudness, the sone scale of subjective

loudness was invented. One sone is arbitrarily taken to be 40 phons at any frequency, i.e.,

at any point along the 40 phon curve on the graph. Two sones are twice as loud, e.g. 40 +

10 phons = 50 phons. Four sones are twice as loud again, e.g. 50 + 10 phons = 60 phons.

The relationship between phons and sones is shown in the chart, and is expressed by the

equation:

Phon = 40 + 10 log2 (Sone)

BASIC COMMUNICATION SYSTEM BLOCK

The research project described in chapter 3 is basically a transmitting and receiving

communication system. Therefore an overview of very basic functional communication

block is given here. Today, Communication has entered our lives in so many different

forms, that it is very difficult to lead a life without various appliances and tools born out

of communication. Communication is the process of conveying something from one

point to other. It can be classified into communication within line of sight and distance

between the transmitter and receiver.

28

If the two points are beyond the line of sight, then the branch of communication,

engineering comes into picture, and it is known as telecommunication engineering.

Figure 2.4: Stages of communication

In communication engineering, the physical message such as sound, words, pictures etc.,

are converted into equivalent electrical values, called signals. This electrical signal is

conveyed to a distant place, through a communication media, and at the receiving end,

this electrical signal is reconverted back into the original message through some media.

Figure 2.5: Basic communication system

29

Source

The message produced by the source is not necessarily electrical in nature, but it may be

a voice signal, a picture signal etc. So, an input transducer is required to convert the

original physical message into a time varying electrical signal. These signals are called

base band signals (or) message signals or modulating signals. At the destination another

transducer is used to convert the electrical signal into appropriate message.

Transmitter

The transmitter comprising electrical and electronic components converts the message

signal into a suitable form for propagating over the communication medium. This is

often achieved by modulating the carrier signal (i.e.,, high frequency signal to carry the

modulating or message signal) which may be an electromagnetic wave. This wave is

often referred as modulated signal.

Modulation and Demodulation

Modulation is the process by which some characteristics of a high frequency carrier

signal is varied in accordance with the instantaneous value of the another signal called

modulating or message signal. Signal containing information or intelligence to be

transmitted is known as modulating or message signal. It is also known as base band

signal. The term base band designates the band of frequencies representing the signal

supplied by the source of information. Usually the frequency of carrier is greater than the

modulating signal. The signal resulting from the process of modulation is called

30

modulated signal. Demodulation is the process of getting back the original signal from

the channel and it is done before receiving the signal at the receiving end.

Channel

The transmitter and the receiver are usually separated in space. The channel provides

connection between the source and the destination. Regardless of its type, the channel

degrades the transmitted signal in a number of ways which produces the signal distortion.

This occurs in a channel, due to imperfect response of channel bandwidth and

contamination of signals due to channel noise.

Receiver

The main function of the receiver is to extract the message signal from the degraded

version of transmitted signal. The transmitter and receiver are carefully designed to

avoid distortion and minimize the effect of the noise from the receiver. So that faithful

reproduction of the message emitted by the source is possible. The receiver has the task

of operating on the received signal so as to reconstruct the recognizable form of the

original message signal and to deliver it to the user destination.

31

Communication types

Figure 2.6: Types of communication

This research will focus on Digital communication of mid-frequency band PCM samples

as indicated in this figure above plus transmission of spectral parameters of the low-

frequency and high- frequency bands.

ANALYSIS/RESYNTHESIS

Analysis-Resynthesis is a technique in which the input signal is analyzed for a short time

and the spectrum of the same is computed. The musician makes the needed modification

and the desired sound is resynthesized in the final stage. One of the major application

tools which use this technique is the phase vocoder [4].

The analysis of a sound, to identify the harmonics that occur in the sound signal, is

performed through the estimation of its power spectrum. Samples of musical tones are

analyzed using the Short-Time Fourier Analysis, to determine the time-varying frequency

characteristics. The analysis is carried out on short segments of the input signal through a

32

technique called windowing, with window widths based on the amplitude envelope

parameters, obtained through analysis of the amplitude of the waveform with respect to

time. The Fast Fourier Transform (FFT) is then applied to these discrete sections to

obtain a spectrum of the sound signal. This data is then used for resynthesis of the

original sound. The spectral model synthesis employed in this research work works in

the same way.

Figure 2.7 Overview of general analysis and synthesis technique

THE FOURIER PHILOSOPHY: DISCRETE FOURIER TRANSFORM

Continuous

For a continuous function of one variable f(t), the Fourier Transform F(f) will be defined

as:

33

F(f) = f(t) e∫∞

∞−

-j2π ft dt

And the inverse transform as

f(t) = F(f) e∫∞

∞−

j2π ft df

Where j is the square root of -1 and e denotes the natural exponent

ejφ = cos(φ ) + j sin(φ )

Discrete

Consider a complex series x(k) with N samples of the form

x0, x1, x2,…….xk……….xN-1

Where x is a complex number

Xi = Xreal + j Xi mag

Further, assume that that the series outside the range 0, N-1 is extended N-periodic, that

is, xk = xk+N for all k. The FT of this series will be denoted X(k), it will also have N

samples. The forward transform will be defined as

x(n) = (1/N) e∑−

=

1

0)(

N

kkx -jk2π n/N

for n = 0…N-1

The inverse transform will be defined as

34

x(n) = e∑−

=

1

0)(

N

kkx jk2π n/N

for n = 0…N-1

Of course although the functions here are described as complex series, real-valued series

can be represented by setting the imaginary part to 0. In general, the transform into the

frequency domain will be a complex valued function, that is, with magnitude and phase.

Magnitude = ||X (n) || = (Xreal * Xreal + Xi mag * Xi mag) 0.5

Phase = tan-1(Xi mag/ Xreal)

The Nyquist Criterion and Sampling Theorem

The sampling theorem (often called "Shannon’s Sampling Theorem") states that a

continuous signal must be discretely sampled at least twice the frequency of the highest

frequency in the signal.

More precisely, a continuous function f(t) is completely defined by samples every 1/fs (fs

is the sample frequency) if the frequency spectrum F(f) is zero for f > fs/2. Fs/2 is called

the Nyquist frequency and places the limit on the minimum sampling frequency when

digitizing a continuous signal.

Normally the signal to be digitized would be appropriately filtered before sampling to

remove higher frequency components. If the sampling frequency is not high enough the

high frequency components will wrap around and appear in other locations in the discrete

spectrum, thus corrupting it.

35

The key features and consequences of sampling a continuous signal can be shown

graphically as follows.

Consider a continuous signal in the time and frequency domain.

Figure 2.8 (a) Fourier transform (continuos)

Sample this signal with a sampling frequency fs, time between samples is 1/fs. This is

equivalent to convolving in the frequency domain by delta function train with a spacing

of fs.

Figure 2.8 (b) Fourier transform (Discrete)

If the sampling frequency is too low the frequency spectrum overlaps, and become

corrupted. This is called aliasing.

36

Figure 2.8 (c) Aliasing

Another way to look at this is to consider a sine function sampled twice per period

(Nyquist rate). There are other sinusoid functions of higher frequencies that would give

exactly the same samples and thus can't be distinguished from the frequency of the

original sinusoid.

CLASSICAL THEORY OF TIMBRE

An overview of timbre definitions and theory behind is provided here because

modifications are possible in this research project. The chapter three contains details of

creating various sound effects by modifications.

International Standards Organization & American National Standards Institute:

"Timbre is that attribute of auditory sensation in terms of which a listener can judge that

two sounds similarly presented and having the same loudness and pitch are dissimilar."

DIMENSIONS OF TIMBRE

A considerable amount of effort has been done in order to find the perceptual dimensions

of timbre, the ‘color’ of a sound. Often these studies have involved multidimensional

37

scaling experiments, where a set of sound stimuli is presented to subjects, who then give

a rating to their similarity or dissimilarity. On the basis of these judgments a low-

dimensional space, which best accommodates the similarity ratings, is constructed and a

perceptual or acoustic interpretation is searched for these dimensions.

Two of the main dimensions described in these experiments have usually been spectral

centroid and rise time. The first measures the spectral energy distribution in the steady

state portion of a tone, which corresponds to perceived brightness. The second is the time

between the onset and the instant of maximal amplitude.

The psychophysical meaning of the third dimension has varied, but it has often related to

temporal variations or irregularity in the spectral envelope. These available results

provide a good starting point for the search of features to be used in musical instrument

recognition systems [21]. Since this research project does not need focus on timbre

theory except for modification parts, timbre is not focused here more. In a pure

sinusoidal model, one has access to a particular frequency in the name of “track”, and the

track can be modified, and new timbres can be created. In this project, there are

opportunities for modifying tracks to create partial timbre modifications. Examples of

that kind will be provided in the third chapter.

AN OVERVIEW OF SOUND-SYNTHESIS TECHNIQUES

When generating musical sound on a digital computer, it is important to have a good

model whose parameters provide a rich source of meaningful sound transformations.

38

Three basic model types are in prevalent use today for musical sound generation:

instrument models, spectrum models, and abstract models. Instrument models attempt

to parametrize a sound at its source, such as a violin, clarinet, or vocal tract. Spectrum

models attempt to parametrize a sound at the basilar membrane of the ear, discarding

whatever information the ear seems to discard in the spectrum. Abstract models, such as

FM, attempt to provide musically useful parameters in an abstract formula. The following

passages will be an overview of widely used sound synthesis techniques for music

purposes.

Additive Synthesis

The Philosophy behind all sound synthesis methods is the Euler’s philosophy which

states that any physical world sound we hear can be broken down to a bunch of sinusoids

viz sine waves and cosine waves. These building blocks can be subjected to mathematical

operations and desired sounds can be synthesized. This is what exactly the Fourier

transform uses in its time to frequency mapping. Additive synthesis is one of the oldest

and most heavily researched synthesis techniques is also based on the summation of

elementary waveforms to create more complex waveforms. This technique is accepted as

the most powerful and flexible spectral modeling technique. It was among the first

synthesis techniques in computer music. It was described extensively in the very first

article of the very first issue of the Computer Music Journal. Additive synthesis technique

assumes that any periodic waveform can be modeled as a sum of sinusoids at various

amplitude envelopes and time-varying frequencies. Basically, it puts together a number

of different wave components together, which can be partials or harmonics, to arrive at a

39

particular sound. Figure shows the additive synthesis function by summing up sinusoids

in order to form specific waveforms. Additive synthesis allows more control than any

other kind of synthesis, as it permits fine control over individual frequency components.

Moreover, a synthesis may interpolate between the frequency spectra from two or more

different sounds. Additive synthesis is effective in the modeling of steady state sounds

rather than portions of sound like the transients in the attack part of the sound.

Figure 2.9: Additive Synthesis

The phase factor: Phase is a trickster. Depending on the context, it may or may not be a

significant factor in additive synthesis. For example, if one changes the starting phases of

the frequency components of a fixed waveform and resynthesizes the tone, this makes no

difference to the listener and yet such a change may have a significant effect on the visual

40

appearance of the waveform. Phase relations become apparent in the perception of the

brilliant but short life of attacks, grains, and transients. The ear is also sensitive to phase

relationships in complex sounds where the phases of certain components are shifting over

time.

Addition of partials is limited in that it succeeds only in creating a more interesting fixed

waveform sound. Since the spectrum in fixed waveform synthesis is constant over the

course of a note, partial addition can never reproduce accurately the sound of an acoustic

instrument. It approximates only the steady state portion of an instrumental tone.

Research has shown that the attack portion of a tone, where the frequency mixture is

changing on a millisecond-by-millisecond timescale, is by more useful for identifying

traditional instrument tones than the steady-state portion. In any case, time-varying

timbre is usually more tantalizing to the ear than a constant spectrum. (Grey 1975) [9]

Time varying Additive synthesis

By changing the mixture of sine waves over time, one obtains more interesting synthetic

timbres and more realistic instrumental tones. In the trumpet note in figure below, it

takes 12 sine waves to reproduce the initial attack portion of the event. After 300 ms,

only three or four sine waves are needed. [9]

41

Figure 2.10: The amplitude progression of the partials of a trumpet tone

Subtractive synthesis

Subtractive synthesis implies the use of filters to shape the spectrum of a sound source by

subtracting unwanted partials of its spectrum, while favoring the resonation of others. As

the source signal passes through a filter, the filter boosts or attenuates selected regions of

the frequency spectrum. If the original source is spectrally rich and the filter is flexible,

subtractive synthesis can shape close approximations of many natural sounds, as well as a

wide variety of new and unclassified timbres. This technique has been used successfully

to model percussion-like instruments and the human voice. [9]

Subtractive synthesis is often referred to as analogue synthesis because most analogue

synthesizers (i.e., non-digital) use this method of generating sounds. In its most basic

form, subtractive synthesis is a very simple process as follows:

42

OSCILLATOR ---------> FILTER ---------> AMPLIFIER

• An Oscillator is used to generate a suitably bright sound. This is routed through a

Filter.

• A Filter is used to cut-off or cut-down the brightness to something more suitable.

This resultant sound is routed to an Amplifier.

• An Amplifier is used to control the loudness of the sound over a period of time so

as to emulate a natural instrument

A filter can be literally any operation on a signal (Rabiner et al. 1972) but the most

common use of the term describes devices that boost or attenuate regions of a sound

spectrum, which is the usage. Filters can be one of these methods:

• Delaying a copy of an input signal slightly and combining the delayed input

signal with the new input signal.

• Delaying a copy of the output signal and combining it with the input signal. [9]

The chapter 3 containing the research project involves lot of band elimination and band

pass filtering. The bandwidth is a measure of the selectivity of the filter and is equal to

the difference between the upper and lower cutoff frequencies. The response of a band-

pass filter is often described by terms such as sharp (narrow) or broad (wide), depending

on the actual width. The pass band sharpness is often quantifies by means of quality

factor (Q). When the cutoff frequencies are defined at the -3 dB points, Q is given by

Q = f0/BW

43

Where BW is the bandwidth [25].

Therefore high Q denotes narrow bandwidth. Bandwidth may also be described as a

percentage of center frequency.

SPECTRUM MODELING SYNTHESIS

The proposed coder in this research work synthesizes the less sensitive signals. While

many synthesis techniques are available, the sinusoidal modeling synthesis is preferred

here because sound could be modeled as a set of sinusoids. Moreover, sinusoidal model

sometimes could be used in denoising applications.

The main advantage of this group of techniques is the existence of analysis procedures

that extract the synthesis parameters out of real sounds, thus being able to reproduce and

modify actual sounds. Our particular approach is based on modeling sounds as stable

sinusoids (partials) plus noise (residual component), thereby analyzing sounds with this

model and generating new sounds from the analyzed data. The analysis procedure detects

partials by studying the time-varying spectral characteristics of a sound and represents

them with time-varying sinusoids. These partials are then subtracted from the original

sound, and the remaining "residual" is represented as a time-varying filtered white noise

component. The synthesis procedure is a combination of additive synthesis for the

sinusoidal part and subtractive synthesis for the noise part.

This analysis/synthesis strategy can be used for either generating sounds (synthesis) or

transforming pre-existing ones (sound processing). To synthesize sounds we generally

44

want to model an entire timbre family, (i.e.,, an instrument) and that can be done by

analyzing single tones and isolated note transitions performed on an instrument and

building a database that characterizes the whole instrument or any desired timbre family,

from which new sounds are synthesized. In the case of the sound-processing application,

the goal is to manipulate any given sound, that is, not being restricted to isolated tones

and not requiring a previously built database of analyzed data. [2]

Some of the intermediate results from this analysis/synthesis scheme, and some of the

techniques developed for it, can also be applied to other music-related problems, e.g.,

sound compression, sound-source separation, musical acoustics, music perception, and

performance analysis.

The Deterministic Plus Stochastic Model

A sound model assumes certain characteristics of the sound waveform or the sound-

generation mechanism. In general, every analysis/synthesis system has an underlying

model. The sounds produced by musical instruments, or by any physical system, can be

modeled as the sum of a set of sinusoids plus a noise residual. The sinusoidal, or

deterministic, component normally corresponds to the main modes of vibration of the

system. The residual comprises the energy produced by the excitation mechanism that is

not transformed by the system into stationary vibrations plus any other energy component

that is not sinusoidal in nature. For example, in the sound of wind-driven instruments, the

deterministic signal is the result of the self-sustained oscillations produced inside the

bore, and the residual is a noise signal that is generated by the turbulent streaming that

takes place when the air from the player passes through the narrow slit. In the case of

45

bowed strings, the stable sinusoids are the result of the main modes of vibration of the

strings, and the noise is generated by the sliding of the bow against the string, plus by

other non-linear behavior of the combined bow-string-resonator system. This type of

separation can also be applied to vocal sounds, percussion instruments and even to non-

musical sounds produced in nature.

A deterministic signal is traditionally defined as anything that is not noise (i.e.,, an

analytic signal, or perfectly predictable part, predictable from measurements over any

continuous interval). However, in the present discussion the class of deterministic signals

considered is restricted to sums of quasi-sinusoidal components (sinusoids with slowly

varying amplitude and frequency). Each sinusoid models a narrowband component of the

original sound and is described by amplitude and a frequency function.

A stochastic signal is fully described by its power spectral density, which gives the

expected signal power versus frequency. When a signal is assumed stochastic, it is not

necessary to preserve either the instantaneous phase or the exact magnitude details of

individual FFT frames.

Therefore, the input sound is modeled by

S(t) = A∑=

R

r 1r(t) cos[θ r (t)] +e (t)

Where Ar(t) and θ r(t) are the instantaneous amplitude and phase of the rth sinusoid,

respectively, and e(t) is the noise component at time t (in seconds).

46

The model assumes that the sinusoids are stable partials of the sound and that each one

has a slowly changing amplitude and frequency. The instantaneous phase is then taken to

be the integral of the instantaneous frequency ω r(t), and therefore satisfies

θ r(t) = ∫t

0

ω r ττ d)(

Where )(tω is the frequency in radians and r is the sinusoid number.

By assuming that e (t) is a stochastic signal, it can be described as filtered white noise,

e(t) = h(t,∫t

0

τ )u (τ )dτ

where u(τ ) is white noise and h (t,τ ) is the response of a time-varying filter to an

impulse at time t. That is, the residual is modeled by the convolution of white noise with

a time-varying, frequency-shaping filter. [2]

Analysis/Synthesis Process: Sinusoids + Noise

The deterministic plus stochastic model has many possible implementations, and we will

present a general one while giving indications on variations that have been proposed.

Both the analysis and synthesis are frame-based processes with the computation done one

frame at a time. Throughout this description, we will consider that we have already

processed a few frames of the sound and we are ready to compute the next one.

47

Figure 2.11: Block diagram of the analysis process ([2])

The figure above shows the block diagram for the analysis. First, we prepare the next

section of the sound to be analyzed by multiplying it with an appropriate analysis

window. Its spectrum is obtained by the Fast Fourier Transform (FFT), and the prominent

spectral peaks are detected and incorporated into the existing partial trajectories by means

of a peak-continuation algorithm. The relevance of this algorithm is that it detects the

magnitude, frequency, and phase of the partials present in the original sound (the

deterministic component). When the sound is pseudo-harmonic, a pitch-detection step

can improve the analysis by using the fundamental frequency information in the peak

continuation algorithm and in choosing the size of the analysis window (pitch-

synchronous analysis). [2]

The stochastic component of the current frame is calculated by first generating the

deterministic signal with additive synthesis and then subtracting it from the original

48

waveform in the time domain. This is possible because the phases of the original sound

are matched, and therefore the shape of the time-domain waveform is preserved. The

stochastic representation is then obtained by performing a spectral fitting of the residual

signal.

Figure 2.12: Block diagram of the synthesis process [2]

The deterministic signal, i.e.,, the sinusoidal component, results from the magnitude and

frequency trajectories, or their transformation, by generating a sine wave for each

trajectory (i.e.,, additive synthesis). This can either be implemented in the time domain

with the traditional oscillator bank method or in the frequency domain using the inverse-

FFT approach.

The synthesized stochastic signal is the result of generating a noise signal with the time-

varying spectral shape obtained in the analysis (i.e.,, subtractive synthesis). As with the

deterministic synthesis, it can be implemented in the time domain by a convolution or in

49

the frequency domain by creating a complex spectrum (i.e.,, magnitude and phase

spectra) for every spectral envelope of the residual and performing an inverse-FFT. [2]

M-Q Synthesis

The spectrum-modeling synthesis described above is mainly targeted toward musical

signals. A similar application but a slightly different algorithm was proposed by Robert

McAulay and Thomas Quatieri for voice and speech signals [5, 26].

In 1986, Robert McAulay and Thomas Quatieri proposed a new method of

analysis/synthesis for discrete-time speech signals that attempted to develop a

reconstruction process that would result in a best possible approximation of the original

signal. They modeled speech signals as two components. The first was an excitation

signal which consisted of a sum of sinusoids with time-varying amplitudes and

frequencies, as well as an initial phase offset. The second component is the voice tract.

This is modeled as a time-variant filter with time-varying magnitudes and phases. These

two components are combined and expressed as

S(t) = A∑=

)(

1

tL

ll(t) ejΨ

l(t)

Where Al(t) combines the time-varying magnitude response of the vocal tract and the

amplitude of the excitation signal, and the phase of the exponential includes the time-

varying phase of the vocal tract as well as the initial phase offset of the excitation signal.

50

Figure 2.13: McAulay-Quatieri Sinusoidal Analysis-Synthesis system: [5]

To find expressions for these sinusoids, they derived a new technique to analyze the

signal. Using overlapping-windowing methods similar to standard short-time analysis,

the MQ method computes Fourier transforms of the individual windows. The peak

frequencies of each window (the partials) are found, and their amplitudes and phases are

extracted. The partials for each window are linked to those in the following window to

develop a trend in the progression of frequencies (their amplitude and phases). We call

each progression a track. The birth of a track occurs when there does not exist a partial in

the previous window with which to connect one in the current window. Conversely, a

death track occurs when a partial does not exist in the following window with which to

connect one in the current window.

51

Figure 2.14: Mcaulay-Quatieri Sinusoidal model for speech [5]

The MQ Model has outstanding results and reproduces inaudibly different signals when

applied to a wide variety of quasi-harmonic sounds. Perhaps its greatest advantage is the

small amount of data required to perform this process. To reproduce a signal using

standard Fourier techniques, information about a great many coefficients must be

retained. To reconstruct perfectly, an infinite number must be used. With the MQ

method, information about several time-varying sinusoids must be stored, and little else.

One of the flaws in the MQ method is how it represents noise. Noise shows up as tracks

that span only a small number of windows. It is difficult to represent these short tracks

using sinusoids, so other methods must be developed (see section entitled Bandwidth-

Enhancement).

52

In the analysis stage, the amplitudes, frequencies, and phases of the model are estimated

on a frame-by-frame basis, while in the synthesis stage these parameter estimates are

interpolated to allow for continuous evolution of parameters at all the sample points

between the frame boundaries.

The Sine Wave Speech Model

In the speech production model, the speech waveform s(t) is assumed to be the output of

passing a vocal cord (glottal) excitation waveform through a linear system representing

the characteristics of the vocal tract. The excitation function is usually represented as a

periodic pulse train during voiced speech, where the spacing between consecutive pulses

corresponds to “pitch” of the speaker. Alternately, the binary voiced/unvoiced excitation

model can be replaced by a sum of sine waves.

Figure 2.15: Peak detection in MQ-approach,: [5]

53

The motivation for this sine-wave representation is that voiced excitation, when perfectly

periodic, can be represented by a Fourier series decomposition in which each harmonic

component corresponds to a single wave. Passing this sine wave representation of the

excitation through the time-varying vocal tract results in the sinusoidal representation of

the speech waveform, which, on a given analysis frame is described by

s(n) = A∑−

L

l 1l(n) + φ l

where Al and φ l represent the amplitude and phase of each sine wave component

associated with the frequency track wl and L is the number of sine waves.[5]

Figure 2.16: Mcaulay-Quartieri Sinusoidal Analysis-Synthesis (Peak picking) [5]

54

Spectral Models Related to the Sinusoidal Model:

Additive synthesis is a traditional sound synthesis method that is very close to the

sinusoidal model. It has been used in electronic music for several decades [Roads 1995].

Like the sinusoidal model, it represents the original signal as a sum of sinusoids with

time-varying amplitudes, frequencies, and phases [Moorer 1985]. However, it does not

differentiate harmonic and inharmonic components. To represent non-harmonic

components it requires a very large amount of sinusoids, therefore giving best results for

harmonic input signals. Vocoders are another group of spectral models. They represent

the input signal at multiple parallel channels, each of which describes the signal at a

particular frequency band. Vocoders simplify the spectral information and therefore

reduce the amount of data. The Phase vocoder is a special type of vocoder that uses a

complex short-time spectrum, thus preserving the phase information of the signal. The

phase vocoder is implemented with a set of band pass filters or with a short-time Fourier

transform. The phase vocoder allows time and pitch scale modifications, like the

sinusoidal model does [4].

The sinusoidal model was originally proposed by McAulay-Quatieri for speech coding

[5] purposes and by Smith and Serra [2, 11] [McAulay-Quatieri 1986; Smith & Serra

1987] for the representation of musical signals. Even though the systems were developed

independently, they were quite similar. Some parts of the systems such as the peak

detection were slightly different, but both systems had all the basic ideas needed for the

sinusoidal analysis and synthesis: the original signal was windowed into frames, and the

55

short-time spectrum was examined to obtain the prominent spectral peaks. The

frequencies, amplitudes and phases of the peaks were estimated and the peaks were

tracked into sinusoidal tracks. The tracks were synthesized using linear interpolation for

amplitudes and cubic polynomial interpolation for frequencies and phases.

Serra [1989] was the first to decompose the signal into deterministic and stochastic parts

to use a stochastic model with the sinusoidal model. Since then, this decomposition has

been used in several systems. The majority of the noises modeling systems use two kinds

of approaches: either the spectrum is characterized by a time-varying filter or the short-

time energies within certain frequency bands [3]

Pitch-Synchronous Analysis

The estimation of the sinusoidal modeling parameters is a difficult task in general. Most

of the problems are related to the analysis window length. If the input signal is

monophonic or consists of harmonic voices that do not overlap in time, it is advantageous

to synchronize the analysis window length to the fundamental frequency of the sound.

Usually, the frequencies of the harmonic components of voiced sounds are integral

multiples of the fundamental frequency. The advantage of the pitch-synchronous analysis

is most easily seen in the frequency domain: the frequencies of the harmonic components

correspond exactly to the frequencies of the DFT coefficients. The estimation of the

parameters is very easy, since no interpolation is needed, and the amplitudes and phases

can be obtained directly from the complex spectrum. Also, pitch-synchronous analysis

56

allows the use of window lengths as small as one period of the sound, while non-

synchronized windows must be 2-4 times the period depending on the estimation method.

This means that a much better time resolution is gained by using the pitch-synchronous

analysis.

Unfortunately, pitch-synchronous analysis can not be utilized in the case where several

sounds with different fundamental frequencies occur simultaneously. In general,

monophonic recordings represent only a small minority among musical signals, and

therefore pitch-synchronous analysis typically can not be used. To keep the complexity of

the system low, the pitch-synchronous analysis was not included in our system.

Adaptive window length has been successfully used in modern audio coding systems, but

in a quite different manner: a long window is used for stationary parts of the signal, and

when rapid changes occur, the window is switched into a shorter one. This enables good

frequency resolution for the stable parts and a good time resolution in rapid changes.

Bandwidth-Enhanced Sinusoidal Modeling

The Reassigned Bandwidth-Enhanced Method [15], developed by Kelly Fitz resolves the

noise modeling problems associated with the MQ method. Using the MQ method,

signals are represented by a collection of sinusoidal components. The peaks in the

spectrum of each window are linked together (Short Time analysis). If the signal being

represented has obvious high peaks or a trend in the frequencies from window to

window, this analysis provides an accurate reconstruction. However, if a signal has

57

significant energy outside the peaks, or very high-frequency noise, the MQ method does

not represent the signal adequately. These signals are said to be noisy. The energy that is

not capable of being represented is called noisy energy because it has frequencies with

fast-varying amplitude.

Figure 2.17: Lemur

These types of signals require many sinusoids to be represented sufficiently. The

sinusoids that do represent them become a track of short duration partials with rapidly

varying amplitudes and frequencies. It is difficult to distinguish the noisy tracks due to

external unwanted noise from the short jittery tracks that are due to the wanted sound

representation. The sinusoidal model does not provide a way of distinguishing noisy

components from deterministic components. In addition, the representation of this noisy

58

Signal is very fragile. Time and frequency manipulation changes phase, which destroys

the properties of the sound and introduces errors in the reconstructed signal.

Figure 2.18: Lemur Graphical tool

To provide a better way of representing noise, the Reassigned Bandwidth-Enhanced

Method uses Bandwidth-Enhanced Oscillators, which spread spectral energy away from

the partial’s center frequency. The partial’s energy is increased while the bandwidth also

increases relative to its spectral amplitude. The center frequency stays the same so that

frequency is spread evenly on both sides. By removing the noisy tracks and increasing

the bandwidth of neighboring tracks, the energy in the signal is conserved and a closer

representation to the original signal can be constructed.

59

Figure 2.19: Bandwidth-Enhanced sinusoidal modeling

These Bandwidth-Enhanced Oscillators can now be used to synthesize a sound signal

from components that have varying frequencies, amplitudes, and concentrations of noise

and sinusoidal energy. A greater variety of sounds can now be represented with greater

accuracy while still using the sinusoidal model for the representation or longer, better-

defined tracks. These Bandwidth-Enhanced partials allow us to manipulate noise

representations without taking away the desired noise representations. The Enhanced-

Bandwidth Sinusoidal Model allows for appropriate representation of noise that provides

a way to distinguish the non-sinusoidal noise that must be removed.

PHASE SYNTHESIS

A summary of importance of phase in audibility is given here. In this research work we

cross examine the importance of phase parameters in the high-frequency region of the

spectrum. Models of pitch perception found in literature, often discard the phase of the

frequency components. This model contradicts time- domain models where the pitch of a

complex tone is given as function of the time interval between peaks in the waveform, in

“some dominant region of basilar membrane”. To verify if the relative phase of

60

harmonics of a complex tone is of importance to the perception of pitch, Moore

conducted a number of experiments. Here it was concluded that phase did infact have an

effect on perceived pitch in some cases. Most often however, it only affected the strength

of the perceived pitch. Cariani and Delgutte later verified this.

Terhardt considered that the pitch of complex tones in which it is assumed that only the

frequency spectrum of the stimulus is important in determining pitch, relative phase of

the frequency components being irrelevant. In the words of Schouten, pitch of a complex

tone is given by the reciprocal of the time interval between corresponding peaks in the

fine structure of the waveform evoked at some dominant region of the basilar membrane.

The fine structure of a waveform may be influenced by changes in the relative phase of

the components, and thus under some circumstances, pitch ought to be affected by

relative phase. Bennen said that the relative phase can affect pitch, but that the effect is

not mediated by changes in temporal structure of the waveform. According to the

temporal model two types of changes in pitch perception might occur with changes in

relative phase of the components, a change in pitch value, and a change in the clarity of

pitch. [17]

"The frequency-domain representation of periodic sounds was studied by the scientists

Ohm, Helmholtz, and Hermann in particular. Ohm stated that changes in the phase

spectrum, although they altered the wave shape, did not affect its aural effect. Helmholtz

developed a method of harmonic analysis with acoustic resonators. According to these

studies, the ear is phase-deaf, and timbre is determined exclusively by the spectrum. Such

61

conclusions are still considered essentially valid for periodic sounds only because Fourier

series analysis-synthesis works only for those.

It was Ohm who first postulated in the early nineteenth century that the ear was, in

general, phase deaf. This view was a gross simplification. There are many instances in

which phase plays an extremely important role in the perception of sound stimuli. In fact,

it was Helmholtz who noted that Ohm's law didn't hold for simple combinations of pure

tones. However, for non-simple tones Ohm's law seems to be well supported by psycho-

acoustic literature. The importance of phase in perceiving musical sounds was

demonstrated by Clark, who clearly showed that in the absence of phase information,

acoustic waveforms sounded unrealistic.

Effect of phase on the timbre

One of the dimensions, which govern the quality, or Timbre of the music instrument is

the directionality of the sound. The directionality, binaural or monaural is decided by the

phase of the sound signal. In general literature of the past doesn’t pay more attention to

the phase of the signal but considering its directional property and spatialization, phase

plays a major role but in general, when we are ready to not consider some dimensions

while synthesizing, phase does not seem to be of much of importance.

Changes in timbre are not distinct enough to be observed after a few seconds required to

alter the phases; anyhow these changes are too small to transfer from one vowel to

62

another. Harmonics beyond the sixth to eighth give dissonances and beats, so it is not

excluded that, for these higher harmonics, a phase effect exist. The maximum effect of

phase on timbre is the difference between a complex tone in which the harmonics are in

phase and one in which alternate harmonics differ in phase by 90o. The effect of

lowering each successive harmonic by 2 dB is greater than the maximum phase effect

described above. The effect of phase on timbre appears to be independent of sound level

and the spectrum.

Phase is perceptually important in many situations. However, it still remains a question

to which extent phase is of importance to modeling of natural occurring sounds using

spectral sound models. For non-periodic sound parts such as transients, the phase of the

frequency components is of great importance to the perception of sounds than steady

parts. [10]

The sound synthesized, using STFT,

S (t) = a∑=

k

n 0n (t) cos [θ n (t)]

where k is the number of partials, an (t), the time varying amplitude, and [θ n (t)] is the

time varying phase.

For synthesis without phase information θ n (t) is simply obtained by integration of

measured frequency values over time: [24]

63

θ n (t) = ∫t

0

ω n(τ ) dτ

Phase is an important parameter when performing additive analysis/synthesis of binaural

recordings. If the phase is left out, the ability to perceive spatial qualities of sounds is

substantially degraded. The phase is important in all incident positions, except

front/back, whereas spectral envelope is mainly influential in the lateral positions.

Perceptually important cues for use in localization are expected to be less present in

sounds synthesized without phase than with phase information. [24]

Perceptual Coding Vs Synthesis based approach

Perceptual coding is a digital audio coding technique that reduces the amount of data

needed to produce high-quality sound. Perceptual digital audio coding takes advantage of

the fact that the human ear screens out a certain amount of sound that is perceived as

noise. Reducing, eliminating, or masking this noise significantly reduces the amount of

data that needs to be provided. With perceptual coding of the record, physical identity is

waived in favor of perceptual identity. Using a psychoacoustical model of the human

auditory system, the codec identifies imperceptible signal content (to remove irrelevancy)

as bits are allocated. The signal is then coded efficiently (to avoid redundancy) in the

final bit stream. These steps reduce the quantity of data needed to represent an audio

signal. The intent is to hide quantization noise below signal-dependent thresholds of

hearing and then code as efficiently as possible. The method asks how much noise can

be introduced to the signal without becoming audible.

64

In the view of many observers, compared to new perceptual coding methods, Pulse-code

Modulation is a powerful but inefficient dinosaur. Due to the appetite for bits, PCM

coding is limited in its usefulness. Achieving lower bit rates through perceptual coding is

limited in its usefulness. The desire to achieve lower bit rates through perceptual coding

is appealing because it opens new application for digital audio (and video) with

acceptable signal degradation. Through psychoacoustics, we can understand how

information is perceived by the ear [8].

Masking and Perceptual Coding

The world presents us with a multitude of sound simultaneously. We automatically

accomplish the task of distinguishing each of the sounds and attending to the ones of

greatest importance. It is often difficult to hear one sound when a much louder sound is

present. This process seems intuitive, but on the psychoacoustic and cognitive levels it

becomes very complex. The term for this process is masking. In order to gain a broad

and thorough understanding of masking phenomenon, we can survey the definition and

its accompanying explanation from several views. Masking as defined by the American

Standards Association (ASA) is the amount (or the process) by which the threshold of

audibility for one sound is raised by the presence of another (masking) sound. For

example, a loud car stereo could mask the car's engine noise. The term was originally

borrowed from studies of vision, meaning the failure to recognize the presence of one

stimulus in the presence of another at a level normally adequate to elicit the first

perception.

65

The purpose of any data-reduction system is to decrease the data rate, the product of the

sampling frequency and the word length. This can be accomplished by decreasing the

sampling frequency; however, the Nyquist theorem dictates a corresponding decrease in

high-frequency audio bandwidth. Another approach uniformly decreases the word

length; however, this reduces the dynamic range of the audio signal by 6dB/bit, thus

increasing the quantization noise. A more enlightened approach uses psychoacoustics.

Perceptual coders maintain sampling frequency but selectively decrease word length;

word-length reduction is done dynamically based on signal conditions [8].

Perceptual coders analyze the frequency and amplitude content of the input signal and

compare it to a model of human auditory perception. Using the model, the encoder

removes the irrelevancy and statistical redundancy of the audio signal. In theory,

although the method is lossy, the human perceiver will not hear degradation in the

decoded signal. Considerable data reduction is possible. For example, a perceptual coder

might reduce a channel’s bit rate from 768 kbps to 128 kbps; a word length of 16

bits/sample is reduced to an average of 2.67 bits/sample, and the data quantity is reduced

by about 83%. A well-designed perceptually coded recording, with a conservative level

of reduction, can rival the sound quality of a conventional recording because the data is

coded in a much more intelligent fashion, and quite simply, because we do not hear all of

what is recorded anyway. Perceptual coders are so efficient that they require only a

fraction of the data needed by a conventional system.

Part of this efficiency stems from the adaptive quantization used by most perceptual

coders. With PCM, all signals are given equal word lengths. Perceptual coders assign

66

bits according to audibility. A prominent tone is given a large number of bits to ensure

audible integrity. Conversely, fewer bits can be used to code soft tones. Inaudible tones

are not coded at all. Together, bit rate reduction is achieved. A coder’s reduction ratio is

the ratio of input bit rate to output bit rate. Reduction ratios of 4:1, 6:1, or 12:1 are

common. Perceptual coders have achieved remarkable transparency, so that in many

applications reduced data is audibly indistinguishable from linearly represented data.

Tests show that ratios of 4:1 or 6:1 can be transparent. [8]

Critical Bands

To determine this threshold of audibility, an experiment must be performed. A typical

masking experiment might proceed as follows. A short, (about 400 msec) pulse of a

1,000 Hz sine wave acts as the target, or the sound the listener is trying to hear. Another

sound, the masker, is a band of noise centered on the frequency of the target (the masker

could also be another pure tone). The intensity of the masker is increased until the target

cannot be heard. This point is then recorded as the masked threshold. Another way of

proceeding is to slowly widen the bandwidth of the noise without adding energy to the

original band. The increased bandwidth gradually causes more masking until a certain

point is reached, at which no more masking occurs. This bandwidth is called the critical

band [6]. We can keep extending the masker until it is full bandwidth white noise, and it

will have no more effect than at the critical band.

67

Bark Band

Center Frequency

(Hz)

Critical Bandwidth

(Hz)

Low Frequency Cutoff

(Hz)

High Frequency Cutoff

(Hz) 1 50 -- -- 100

2 150 100 100 200

3 250 100 200 300

4 350 100 300 400

5 450 110 400 510

6 570 120 510 630

7 700 140 630 770

8 840 150 770 920

9 1000 160 920 1080

10 1170 190 1080 1270

11 1370 210 1270 1480

12 1600 240 1480 1720

13 1850 280 1720 2000

14 2150 320 2000 2320

15 2500 380 2320 2700

16 2900 450 2700 3150

17 3400 550 3150 3700

18 4000 700 3700 4400

19 4800 900 4400 5300

20 5800 1100 5300 6400

68

21 7000 1300 6400 7700

22 8500 1800 7700 9500

23 10500 2500 9500 12000

24 13500 3500 12000 15500

25 18775 6550 15500 22050

Table 2.1: Critical Bandwidth as a function of center frequency and critical band rate [8]

Critical bands grow larger as we ascend the frequency spectrum. Conversely, we have

many more bands in the lower frequency range, because they are smaller. Critical bands

seem to be formed at some level by a auditory filter bank. Critical bands and their center

frequencies are continuous, as opposed to having strict boundaries at specific frequency

locations. Therefore, the filters must be easily variable. Use of the auditory filter bank

may be the unconscious equivalent of our willfully focusing on a specific frequency

range.

Non-Simultaneous Masking

The ASA definition of masking does not address non-simultaneous masking. Sometimes

a signal can be masked by a sound preceding it, called forward masking, or even by a

sound following it, called backward masking. Forward masking results from the

accumulation of neural excitation, which can occur for up to 200 msec. In other words,

neurons store the initial energy and cannot receive another signal until after they have

passed it, which may be up to 200 msec. Forward masking effects are slight because

maskers need to be within the same critical band and even then do not have the broad

69

masked audiograms of simultaneous masking. Likewise, backward masking only occurs

under tight tolerances.

Central Masking and Other Effects

Another way to approach masking is to question at what level it occurs. Studies in

cognition have shown that masking can occur at or above the point where audio signals

from the two ears combine. The threshold of a signal entering monaurally can be raised

by a masker entering in the other ear monaurally. This phenomenon is referred to as

central masking, because the effect occurs between the ears.

Spatial location can have a profound effect on the effectiveness of a masker. Many

studies have been performed in which unintelligible speech can be understood once the

source is separated in space from the interference. The effect holds whether the sources

are actually physically separated or perceptually separated through the use of interaural

time delay. Asynchrony of the onset of two sounds has shown to help prevent masking,

as long as the onset does not fall within the realm of non-simultaneous masking. Each 10

msec increase in the inter-onset interval was perceived as being equal to a 10 dB increase

in the target's intensity [6].

Fusion

The concept of fusion must be included in any intelligent discussion of masking, because

the two are similar and often confused. In both cases, the distinct qualities of a sound are

lost, and both phenomena respond in the same manner to the same variables. In fusion,

like in masking, the target sound cannot be identified, but in fusion the masker takes on a

70

different quality. The typical masking experiment does not necessarily provide a

measure of perceptual fusion. In a fusion experiment, on the other hand, listeners are

asked whether they can or cannot hear the target in the mixture or, even better, to rate

how clearly they can hear the target there. What we want to know is whether the target

has retained its individual identity in the mixture. [6]

Fusion takes into consideration interactive global effects of two sound sources on each

other, instead of trying to reduce the situation to two separate and distinct entities.

Masking experiments are concerned with finding the threshold at which the target cannot

be identified, ignoring the effect of the target on the masker. Use of psychoacoustic

principles for the design of audio recording, reproduction, and data reduction devices

makes perfect sense. Audio equipment is intended for interaction with humans, with all

their abilities and limitations of perception. Traditional audio equipment attempts to

produce or reproduce signals with the utmost fidelity to the original. A more

appropriately directed, and often more efficient, goal is to achieve the fidelity perceivable

by humans. This is the goal of perceptual coders.

The core part of a perceptual coder is the psychoacoustic model, which is the heart of the

system. Generally a psychoacoustic model performs, a time to frequency mapping,

determines maximum SPL levels, determines threshold in quiet, identify tonal and no

tonal components, decimates the maskers, calculates masking thresholds, determines

global masking thresholds, determines minimum masking thresholds and calculates

signal to mask ratios.

71

Figure 2.20: MPEG audio compression and decompression

Although one main goal of digital audio perceptual coders is data reduction, this is not a

necessary characteristic. Perceptual coding can be used to improve the representation of

digital audio through advanced bit allocation.

All data reduction schemes are not necessarily perceptual coders. Some systems, the

DAT 16/12 scheme for example, achieve data reduction by simply reducing the word

length, in this case cutting off four bits from the least-significant side of the data word,

achieving a 25% reduction. The data reduction scheme in the present research work,

however, uses a different scheme based on partial sound synthesis that relies on auditory

phenomenon of human sensitivity to certain frequencies. Though the mid-frequencies are

modulated to the baseband by sampling the signal at a lower sampling rate and

downsampling it later in time, an upsampling process interpolates the in-between samples

,and then the signal is demodulated back to it’s original spectral location. This research is

72

not a perceptual-coding scheme, although it uses or takes advantage of perceptual

phenomena

Out of a desire for simplicity, the first digital audio systems were wide-band systems,

tackling the entire audio spectrum at once. Presently, perceptual coders are multiband

systems, dividing up the spectrum in a fashion that mimics the critical bands of

psychoacoustics. By modeling human perception, perceptual coders can process signals

much the way humans do, and take advantage of phenomena such as masking.

While using adaptive delta pulse-code modulation (ADPCM), the frequency spectrum is

divided into four bands to remove unperceivable material. Once a determination is made

as to what can be discarded, the remainder is allocated the available number of bits. This

process is called dynamic bit allocation.

History of Synthesis-based audio Data reduction

The synthesis-based music data reduction is a relatively young area of research. Xavier

Serra’s sinusoidal plus stochastic residual noise model [2, 11] was an effective synthesis -

based data reduction system. A similar idea was MQ synthesis by McAulay and Quatieri

[5, 20, and 26]. The application here was data reduction for digital speech processing

applications. This research work is all about improving or replacing the two powerful

music and speech data reduction approaches. The third chapter forms the research

project. Eric Scheirer wrote a synthesis-based data reduction approach as a part of

MPEG-4, 7 and 21, the structured audio orchestral language (SAOL). Unlike all these

historical synthesis-based data reduction schemes, which employ complete synthesis

73

techniques, our data reduction scheme is a partial synthesis scheme. There is no

connection to SAOL in this research work. However, we will do a review of the same

because we are reviewing the history of synthesis based approaches in data reduction.

AN OVERVIEW OF SAOL: (Structured Audio Orchestral language)

SAOL is a powerful, flexible language for describing music synthesis, and integrating

synthetic sound with "natural" (recorded) sound in an MPEG-4 bit stream. MPEG-4

integrates the two common methods of describing audio on the internet today: streaming

low-bit rate coding and structured audio descriptions (like MIDI files).

SAOL lives within the MPEG-4 paradigm of streaming data and decoding processes.

Thus, the Structured Audio toolset is not only a method of synthesis, but a streaming

format appropriate for internet-based (or any other channel) transmission of audio data.

The saolc package contains a program for encoding score and orchestras into the

streaming format, and facility for decoding this format.

MPEG-4 Structured Audio has its roots in another Media Lab project called Netsound,

developed by Michael Casey and other members of the Machine listening group at the

MIT Media Lab in 1995-1996. NetSound has similar concepts to MPEG-4 Structured

Audio but uses Csound developed by Barry Vercoe for synthesis.

There are five major elements to the Structured Audio toolset:

• The Structured Audio Orchestra Language (or SAOL) is a digital-signal processing

language that allows for the description of arbitrary synthesis and control algorithms

74

as part of the content bit stream. The syntax and semantics of SAOL are standardized

here in a normative fashion.

• The Structured Audio Score Language, (or SASL) is a simple score and control

language which is used in certain profiles to describe the manner in which sound-

generation algorithms described in SAOL are used to produce sound.

• The Structured Audio Sample Bank Format (or SASBF) The Sample Bank format

allows for the transmission of banks of audio samples to be used in Wavetable

synthesis and the description of simple processing algorithms to use with them.

• A normative scheduler description. The scheduler is the supervisory run-time element

of the Structured Audio decoding process. It maps structural sound control, specified

in SASL or MIDI, to real-time events dispatched using the normative sound-

generation algorithms.

• Normative reference to the MIDI standards, standardized externally by the MIDI

Manufacturers Association. MIDI is an alternate means of structural control which

can be used in conjunction with or instead of SASL. Although less powerful and

flexible than SASL, MIDI support in this standard provides important backward-

compatibility with existing content and authoring tools. [16]

Our research does not focus on any synthesis methods which use an audio/music

description format languages, nor does it has a package for writing music scores. Such a

project is reserved as a future extension of this research work. Our method is

transmitting PCM samples which are very sensitive perceptually and synthesize spectrum

which are not much sensitive in perception. Details of this project implementation follow

in the next chapter.

75

CHAPTER-3

THE RESEARCH PROJECT

A Hi-Fidelity audio data reduction scheme using Partial Sinusoidal modeling

Synthesis Scheme (PSMS)

While perceptual coding is based on energy levels of our perception, and the masking

phenomenon, this synthesis-based data reduction approach presented here is based on

making use of our complex pitch perception. In other words, the former takes the vertical

amplitude scale hearing into account to code the bits which we hear based on loudness

criterion whereas the later takes the horizontal frequency scale, to make use of the

complex pitch perception mechanisms which our auditory systems somehow manage to

do. This project takes advantage of auditory systems complexity in order to engineer the

music product.

A real world example

We hear music inside a room. When we come out of the room and close the door, we

still hear music with an obvious attenuation in the energy level we perceive. The door

acts as an attenuator. The door also filters out some frequency components. Since our

auditory system is sensitive to mid-frequencies, it somehow perceives most of the mid

frequency contents. We hear the pitch (fundamental) but not necessarily timbre. The

music that we perceive after the door is closed is the mid-frequency spectrum. We

76

transmit the mid-frequency content. The low and high frequencies are synthesized. The

physical spectrum of the synthesized parts will differ widely or slightly over each short

time period. However, the variation in spectral shape does not mean that the sound will

be different from that of the original. The synthesized content will sound almost close to

original content because our auditory system does not follow spectral shapes exactly.

Hence we make use of the phenomenon of our inabilities in following the complex pitch,

the missing fundamental concept, and other concepts which we discussed in detail in the

Chapter 2.

One another example is our daily conversation with people over telephones and mobiles

where we hear only between 1 to 4 kHz due to the frequency range of the speech signal.

This helps the transmission purposes. The aim in our project is not to do intuitive

research on the complex topic of pitch perception procedures done by the auditory system

but to make use of those valid ideas to engineer a music compression and synthesis

system. Frequency is a literal measurement, pitch is not. Pitch is a subjective, complex

characteristic based on frequency as well as other physical quantities such as waveform

and intensity. For example, if a 200-Hz sine wave is sounded at a soft then louder level,

most listeners will agree that the louder sound has a lower pitch. In fact, a 10% increase

in frequency might be necessary to maintain a listener’s subjective evaluation of constant

pitch at low frequencies. On the other hand, in the ear’s most sensitive region, 1 to 5

kHz, there is almost no change in pitch with loudness. Also, with musical tones, the

effect is much less [8].

77

Figure 3.1: Perceptual Coding approach Vs Synthesis based approach

78

Humans can identify pitch defects very easily in the sensitive region, the mid-frequency

region, but in this work we do not approximate or do alternations/modifications to the

mid-frequency samples. We transmit it as such thereby avoiding the possibilities of

creating a defective signal. This chapter will cover four major topics.

1. Two-filter method;

2. Frequency resolution downsampling method;

3. Four-filter method;

4. Advantages of using a sine plus noise model in the two filter method.

FLETCHER AND MUNSON CONTOURS: DATA SETS

In order to analyze the success level of the four topics mentioned above, visual Fletcher

Munson plots will be useful. Therefore the full data set is included in figures.

Figure 3.2: Fletcher Munson original curves (Fig 3 [1])

79

Figure 3.3: Figure 2 mentioned in [1]

From these curves in Figure 3.3 loudness level contours can be drawn. The first sets of

loudness level contours are plotted with levels above the reference threshold as ordinates.

80

For example, the zero loudness level contours correspond to points where the curves of

figure 3.3 intersect the abscissa axis. The number of dB above these points is plotted as

the ordinate in the loudness level contour shown in Figure 3.4 [1].

Figure 3.4: Figure 3 in [1]

In Figure 3.2 similar sets of loudness level contours are shown using intensity levels as

ordinates.

81

Figure 3.5: Matlab plot of figure 3.4

The formulae for computing the Figure 3 plots of the 1933 Fletcher Munson paper are

given in Table BI of the paper cited in [22]. This is a nonlinear regression formula. It is

fit to the 1933 paper raw data rather than the published curves, which are quite smoothed.

y = C3*x3 + C2*x2 + C1*x + C0

where C0, C1, C2, C3 are polynomial coefficients

y(1) = 7.46169*(10^-5)* x3 - 0.00984189* x2+ 0.74629*x + 0.425879; (62 Hz)

y(2)= 5.2594*(10^-5)* x3 - 0.00654132* x2+ 0.800557*x+0.295663; (125 Hz)

y(3) = 0.00124457* x2+ 0.720323*x + 0.780066; (250 Hz)

y(4) = 0.00209933* x2 + 0.761911*x + 0.467849; (500 Hz)

y(5) = x; (1000 Hz)

82

y(6) = -0.0011956* x2+ 1.14141*x - 0.622967; (2000 Hz)

y(7) = -0.00240718* x2 + 1.2314*x - 0.393083; (4000 Hz)

y(8)= -0.00272458* x2+ 1.24014*x + 0.481007; (5650 Hz)

y(9) = -0.00232339* x2 + 1.20659*x + 0.0426691; (8000 Hz)

y(10) = -0.002439* x2 + 1.24474*x - 1.51871; (11300 Hz)

y(11) = -0.000566296* x2 + 1.03446*x - 1.82771; (16000 Hz)

Phon Sound Pressure Levels (dB)

0 48 32 20 5 0 -24 -32 -20 -12 -10 8

10 52 40 28 14 10 7 2 10 20 22 30

20 60 48 35 25 16 20 12 20 30 32 38

30 65 52 42 32 28 30 22 30 40 42 50

40 68 60 48 42 40 40 35 42 52 50 58

50 71 64 55 49 50 50 46 55 60 65 66

60 75 68 65 60 60 60 56 62 72 76 79

70 80 72 70 71 70 70 63 72 82 83 85

80 85 82 81 80 81 80 75 82 89 90 95

90 90 91 90 90 88 90 83 90 96 99 108

100 100 100 102 104 98 100 90 99 104 105 111

110 111 110 111 113 108 110 98 105 110 112 118

Table 3.1: Fletcher Munson curve: data sets

83

THE GENERAL PROCEDURE

Two-Filter Method

Figure 3.6: The Schematic Block Diagram employed in our Synthesis based Data

Reduction (Two filter method).

We are sensitive to the mid-frequencies around 3300 Hz. The input audio signal is

filtered with a band pass filter with cutoff frequencies 0.9 KHz and 6.1 KHz that filters

these mid-frequencies. A band-elimination filter with cutoff frequencies 1.1 KHz and 5.9

KHz eliminates the mid band from the input signal. A 200 Hz overlap is set between

transition band to avoid phase distortion and spectral leakage. The output of the band

pass filter is the most sensitive data that we hear. The spectral contents above the

threshold of hearing are modulated to base band and are downsampled in time. The base

84

band signal is transmitted over the communication channel. At the receiver, the base

band is reconstructed back to the pass band area. The band pass filter (0.9 KHz and 6.1

KHz) and band elimination filters (1.1 kHz and 5.9 kHz) can be designed based on ones

need. The cutoff frequencies were chosen based on Fletcher and Munson curves.

Moreover, the Fletcher-Munson curves provide information about the threshold of

hearing that helps the peak detection algorithm.

The output of the band-rejection filter forms the less sensitive data. This output forms

input to a sinusoidal/spectral model [2]. A short time Fourier transform is applied to

windowed time frames with a 75% frame overlap rate. The length of the window is fixed

such that it captures the periodicity more than one time. The 75% overlap frame rate is

set to avoid signal leakage. Normally a minimum of 75% overlap for hamming windows

and 50% overlap for rectangular windows are needed to avoid signal leakage. This is

because the main lobe width of hamming window is four bins and that of rectangular

window is two bins. The hop size is the product of the analysis frame length and the

inverse of the main lobe width in bins. In any analysis and synthesis methods, the choice

of window is very critical. In this case Kaiser Windowing scheme and hanning

windowing scheme turned to be comparatively successful than other windowing

schemes. Different windowing schemes will be elaborately discussed later in this

chapter.

The short time Fourier spectrum was computed and the prominent peaks in the power

spectrum were picked. A peak here is defined as a local maximum. There are chances

85

where two peaks might be very close to each other. In such cases, the maximum of the

two peaks is chosen. The other peak is deleted. This will help us in maintaining the

frequency characteristic of a sinusoid while connecting the peaks into frequency distinct

tracks.

Figure 3.7: The Two Filter method: Band Pass and Band elimination filters

Violin spectrum, FS = 44100, mono, 16 Bit

Modulation and Demodulation: MF Band

The output of the band-pass filter has the most sensitive data that has to be transferred

through the channel as PCM audio samples. The sensitive data is modulated to the base

band with a sampling frequency twice the highest frequency in the base band. The base

band signal is downsampled in time by a factor four. This reduces the data through the

86

channel. On the receiver end, the data is up sampled by a factor four to recover the

original length and also moved to the original pass band with a sampling rate of the input

mid band and fused with the less sensitive data that forms the output of the oscillator at

the synthesis end. However, extreme care should be taken while dealing with the

sensitive data because any artifacts in this data will be very clearly audible to human ears.

Figure 3.8: Modulation and demodulation of sensitive data

(Country music, 44.1, 16 bit, mono at 705 Kbps)

In the Figure 3.8, a downsampling factor two is set. However, with a factor four better

data reduction can be obtained.

87

Channel

The modulated mid-frequency content is transmitted over the communication channel.

During the course of transmission, the channel noise, which consists of the transmission

noise and, reception noise, will affect the original data.

Partial Sinusoidal Modeling Synthesis (PSMS): LF and HF band

The main advantage of spectrum modeling techniques is the existence of analysis

procedures that extract the synthesis parameters out of real sounds, thus being able to

reproduce and modify actual sounds. SMS is based on modeling sounds as stable

sinusoids (partials) plus noise (residual component), therefore analyzing sounds with this

model and generating new sounds from the analyzed data. Before starting the analysis

band elimination filter chops off the mid frequency spectral data to which we are

sensitive. The analysis procedure detects partials by studying the time-varying spectral

characteristics of a sound and represents them with time-varying sinusoids. The

synthesis procedure is additive synthesis method where the instantaneous amplitudes,

frequencies and phases are fed into separate oscillators, and all sinusoids are added frame

to frame. In audio signal spectrum modeling, the aim is to transform a signal to a more

easily applicable form, removing the information that is irrelevant in signal perception..

A sufficient time and frequency resolution is also difficult to achieve at the same time.

The standard pulse code modulated (PCM) signal which basically describes the sound

pressure levels reaching the ear is not a good presentation for the analysis of sounds. A

88

general approach is to use spectrum modeling, or a suitable middle-level representation to

transform the signal into a form that can be generated easily from the PCM signal, but

from which also the higher level information can be more easily obtained. The

sinusoids+noise model is one of these representations. The sinusoidal part utilizes the

physical properties of general resonating systems by representing the resonating

components by sinusoids. The noise model utilizes the inability of humans to perceive the

exact spectral shape or phase of stochastic signals. The sinusoids+noise model has the

ability to remove irrelevant data and encode signals with lower bit rate. It has also been

successfully used in audio and speech coding [3]. This project focuses on only a

complete sinusoidal model. PSMS using a sine plus noise model for less sensitive data

synthesis will be a future extension of the project.

A short overview of the process theoretically and the results obtained during the step-by-

step implementation follows. At first, the input signal is analyzed to obtain time-varying

amplitudes, frequencies and phases of the sinusoids. Then, the sinusoids are synthesized.

In the parametric domain, we can make modifications to produce effects like pitch

shifting or time stretching. The analysis of sinusoids is the most complex part of the

system. Firstly, the input signal is divided into partly overlapping and windowed frames.

Secondly, the short-time spectrum of the frame is obtained by taking a discrete Fourier

transform (DFT/FFT). The spectrum is analyzed, prominent spectral peaks are detected

,and their parameters, amplitudes, frequencies, and phases are estimated. Once the

amplitudes, frequencies and phases of the detected sinusoidal peaks are estimated, they

are connected to form interframe trajectories. A peak continuation algorithm tries to find

89

the appropriate continuations for existing trajectories from among the peaks of the next

frame. The obtained sinusoidal trajectories contain all the information required for the

resynthesis of the sinusoids. The sinusoids can be synthesized by interpolating the

parameters of trajectories and summing the resulting waveforms in time domain.

The next phase of Partial SMS includes magnitude and phase spectra computation, Peak

Detection, Peak Continuation, Cubic Spline interpolation between time frames,

modifications of the Analysis Data, Synthesis, and fusion of sensitive and less sensitive

data. The very first step to begin with is the magnitude and phase spectra computation.

FFT Analysis of Low and High Frequency Bands

The computation of the magnitude and phase spectra of the current frame is the first step

in the analysis. The control parameters for the STFT (window-size, window-type, FFT-

size, and frame-rate) have to be set in accordance with the sound to be processed. First of

all, a good resolution of the spectrum is needed since the process that tracks the partials

has to be able to identify the peaks.

In case if a PSMS involves a sine plus noise model, the phase information is particularly

important for subtracting the deterministic component to find the residual; we should use

an odd-length analysis window, and the windowed data should be centered in the FFT-

buffer at the origin to obtain the phase spectrum free of the linear phase trend induced by

the window ("zero-phase" windowing).

90

Figure 3.9: Original and Windowed short time signals, Fourier analysis (Hanning

window)

The time-frequency compromise of the STFT must be well understood. For the

deterministic analysis, it is important to have enough frequency resolution to resolve the

partials of the sound. For the stochastic analysis the frequency resolution is not that

important, since we are not interested in particular frequency components and we are

more concerned with a good time resolution. This can be accomplished by using different

parameters for the deterministic and the stochastic analysis. In this project we use a

91

sinusoidal model and not a sine plus noise model. The reason for not using a sine plus

noise model will be discussed in the coming sections.

Figure 3.10: Magnitude and Phase spectrum of LF and HF bands

The computation of the spectra is carried out by the short-time Fourier transform

technique.

Window choice

The window choice plays an important role in any analysis and synthesis system. The

choice depends on the problem encountered, the amount of overlap rate, and the precision

92

expected and computational cost. Popular windows like the Hamming requires 75%

overlap rate, i.e, the hop size is 25% of the analysis frame length. The rectangular

window needs at least 50% overlap. In steady-state sounds we should use long windows

(several periods) with a good side-lobe rejection (for example, Blackman-Harris 92dB)

for the deterministic analysis. This gives a good frequency resolution, therefore a good

measure of the frequencies of the partials. In the case of harmonic sounds the actual size

of the window will change as pitch changes, in order to assure a constant time-frequency

trade-off for the whole sound.

The choice of analysis window determines the trade-off of time verses frequency

resolution, which affects the smoothness of the spectrum and the detectability of different

sinusoidal components. The most commonly used windows are rectangular, hamming,

hanning, Kaiser, Blackman and Blackman-Harris.

All the standard windows are real and symmetric and have a frequency spectrum with a

sinclike shape. For the purposes of SMS and in general for any sound analysis/synthesis

application, the choice of window is mainly determined by two of its spectral

characteristics. They include the width of the main lobe, defined for present purposes as

the number of bins between zero crossings on either side of the main lobe when the DFT

length equals the window length, and the highest side-lobe level, which measures the

gain from the highest side lobe to the main lobe. Ideally, we want a narrow main lobe

[i.e., good frequency resolution) and a very low side lobe level). The choice of window

determines this trade-off. The rectangular window has the narrowest main lobe, (two

bins), but the first side lobe is very high-13 dB relative to the main lobe peak. The

93

Hamming window has a wider main lobe, four bins, and highest side lobe is 43 dB down.

A very different window, the Kaiser, allows control of the trade-off between main lobe

width and the highest side lobe level. If a narrower main lobe width is desired, then the

side lobe level will be higher, and vice versa. Since the control of this tradeoff is

valuable, Kaiser Window is good general purpose choice. The window length must be

sufficient to resolve the most closely spaced sinusoidal frequencies. A nominal choice

for periodic signals is about four periods [11].

DFT Computation

Once a section of the waveform has been windowed, the next step is to compute the

spectrum using the DFT. For practical purposes, the FFT should be used whenever

possible, but this requires the length of the analyzed signal to be a power of two. This

can be accomplished by taking any desired window length and “zero padding,” i.e.,,

filling with zeros out to the length required by the FFT. This not only allows use of FFT

algorithm, but computes a smoother spectrum as well. Zero padding in time domain

corresponds to spectral interpolation.

The size of the FFT, N, is normally chosen to be the first power of two that is at least

twice the window length M, with the difference N-M filled with zeros. If B, the number

of samples in the main lobe when the zero padding factor is 1 (N=M), then a zero

padding factor of N/M gives BsN/M samples for the same main lobe [and same main lobe

bandwidth). The zero-padding (interpolation) factor N/M should be large enough to

enable an accurate estimation of the true maximum of the main lobe. That is, since the

window length is not an exact number of periods for every sinusoidal frequency, the

94

spectral peaks do not, in general, occur at FFT bin frequencies (multiples of Fs/N).

Therefore, the bins must be interpolated to estimate peak frequencies. Zero padding is

one type of spectral interpolation [11].

Choice of Hop Size

Once the spectrum has been computed at a particular frame in the waveform, the STFT

hops along the waveform and computes the spectrum of next section in the sound. This

hop size H is an important parameter. Its choice depends very much on the purpose of

the analysis. In general, more overlap will give more analysis points and therefore

smoother results across time, but the computational expense is proportionally greater. A

general and valid criterion is that the successive frames should overlap in time, in such a

way that all the data are weighted equally. A good choice is the window length divided

by the main lobe width in bins. For example, a practical value for the hamming window

is to use a hop size equal to one fourth of the window size [11].

Peak Detection in PSMS

The input sound is filtered by a band-pass filter in and around the region of sensitivity of

human ear. Once the spectrum of the current frame is computed, the next step is to detect

its prominent magnitude peaks. Theoretically, a sinusoid that is stable both in amplitude

and in frequency (a partial) has a well defined frequency representation: the transform of

the analysis window used to compute the Fourier transform.

95

It should be possible to take advantage of this characteristic to distinguish partials from

other frequency components. However, in practice, this is rarely the case, since most

natural sounds are not perfectly periodic and do not have nicely spaced and clearly

defined peaks in the frequency domain.

Figure 3.11: Peak detection in LF and HF bands

There are interactions between the different components, and the shapes of the spectral

peaks cannot be detected without tolerating some mismatch. Only some instrumental

sounds (e.g., the steady-state part of an oboe sound) are periodic enough and sufficiently

free from prominent noise components that the frequency representation of a stable

96

sinusoid can be recognized easily in a single spectrum. A practical solution is to detect as

many peaks as possible and delay the decision of what is a deterministic, or "well

behaved" partial, to the next step in the analysis: the peak continuation algorithm [2].

However, in this project work, we track all the available tracks just like a Mcaulay-

Quartieri algorithm does.

A "peak" is defined as a local maximum in the magnitude spectrum, and the only

practical constraints to be made in the peak search are to have a frequency range and a

magnitude threshold. In fact, we should detect more than what we hear and get as many

sample bits as possible from the original sound, ideally more than 16. The measurement

of very soft partials, sometimes more than 80dB below maximum amplitude, will be

difficult and they will have little resolution. These peak measurements are very sensitive

to transformations, because as soon as modifications are applied to the analysis data,

parts of the sound that could not be heard in the original can become audible. The

original sound should be as clean as possible and have the maximum dynamic range, and

then the magnitude threshold can be set to the amplitude of the background noise floor.

Due to the sampled nature of the spectra returned by the FFT, each peak is accurate only

to within half a sample. A spectral sample represents a frequency interval of Fs/N Hz,

where Fs is the sampling rate and N is the FFT size. Zero-padding in the time domain

increases the number of spectral samples per Hz and thus increases the accuracy of the

simple peak detection. However, to obtain frequency accuracy on the level of 0.1% of the

distance from the top of an ideal peak to its first zero crossing (in the case of a

Rectangular window), the zero-padding factor required is 1000. A more efficient spectral

97

interpolation scheme is to zero-pad only enough so that quadratic (or other simple)

spectral interpolation, using only samples immediately surrounding the maximum-

magnitude sample, suffices to refine the estimate to 0.1% accuracy.

Figure 3.12: Missed Peaks

In real cases, some peaks might lie below the threshold of hearing. This is because, the

partials that were not audible before modifications may be clearly perceivable after

modifications. The SMS specifies to go 80 decibels below the threshold of hearing.

However, the sinusoidal model involved in this project is not a sine plus noise model.

The peak detection algorithm only finds the prominent peak around a local maximum

separated by a specified distance between peaks. We went 10 or 20 decibels below the

98

threshold of hearing to pick the peaks that were missed. In this project we are not going

to deal much about modifications because this is a data reduction scheme.

Figure 3.13: Peaks below threshold

Figure: The figure showing peaks picked and its appropriate phase matches. Some of the

peaks below threshold were missed. This problem is corrected by choosing threshold 10

dB below actual threshold

99

Peak Continuation

Once the spectral peaks corresponding to the low frequency and high frequency bands of

the current frame have been detected, the peak continuation algorithm adds them to the

incoming peak trajectories. The basic idea of the algorithm is that a set of "guides"

advances in time through the spectral peaks, looking for the appropriate ones (according

to the specified constraints) and forming trajectories out of them. Thus, a guide is an

abstract entity which is used by the algorithm to create the trajectories and the trajectories

are the actual result of the peak continuation process. The instantaneous state of the

guides, their frequency, phase and magnitude, are continuously updated as the guides are

turned on, advanced, and finally turned off. The schemes used in the sinusoidal model

(McAulay and Quatieri, 1984; 1986) [5] find peak trajectories both in the noise and

deterministic parts of a waveform, thus obtaining a sinusoidal representation for the

whole sound. These schemes are unsuitable when we want the trajectories to follow just

the partials. For example, when the partials change in frequency substantially from one

frame to the next, these algorithms easily switch from the partial that they were tracking

to another one which at that point is closer. In this project some parameters for the input

to the algorithm. Therefore the process of tracking the music and noise is not fully

automatic. The specifications of the parameters are mentioned in the forthcoming

passages.

100

Initial Guides

With this parameter, the user specifies the approximate frequency of the partials that are

known to be present in the sound, thus reserving guides for them. The algorithm adds

new guides to this initial set as it finds them. When no initial guides are specified, the

algorithm creates all of them. One another method is to create initial guides at some

equal intervals and allow the algorithm to update the guides using the data set being

tracked.

Maximum Peak Deviation

Guides advance through the sound, selecting peaks. This parameter allows a control of

the maximum allowable frequency distance from a peak to the guide that is selected by.

It is useful to make this parameter a function of frequency in such a way that the

allowable distance is bigger for higher frequencies than for lower ones. Thus the

deviation can follow a logarithmic scale, which is perceptually more meaningful than a

linear frequency scale. However, since the model used in this project tracks both noisy

components and meaningful music components, we do not use logrithmic scale.

Peak Contribution to Guide

The frequency of each guide does not have to correspond to the frequency of the actual

trajectory. It is updated every time it incorporates a new peak. This parameter is a

101

number from 0 to 1 that controls how much the guide frequency changes when a new

peak is incorporated. That is, given that the current guide has a frequency f`~, what will

be its value when it incorporates a peak with frequency h. For example, if the value of

the parameter is 1, it means that the value of the guide, f`~, is updated to h, thus the peak

makes the maximum contribution. If the value of the parameter is smaller, the

contribution of the peak is correspondingly smaller; the new value falls between current

value of f`~, and h. This parameter is useful, for example to circumscribe a guide to

narrow frequency band.

Maximum Number of Guides

This is the maximum number of guides used by the peak-continuation process at each

particular moment in time. The total number of guides may be bigger when a guide is

turned off a new one can use its place.

Minimum Starting Guide Separation

A new guide can be created at any frame from a peak that has not yet been incorporated

into any existing guide. This parameter specifies the minimum required frequency

separation from a peak to the existing guides in order to create a new guide at that peak.

Consequently, through this parameter peaks which are very close to existing guides can

be rejected as candidates for starting guides.

102

Maximum Sleeping Time

When a guide has not found a continuation peak for a certain number of frames, the guide

is killed. This parameter specifies the maximum “non active” time, that is, the maximum

number of frames that the guide can be alive while not finding continuation peaks.

Maximum Length of Filled Gaps

Given that a certain sleeping time is allowed, we may wish to fill the resulting gaps. This

parameter specifies the length of the biggest gap to be filled (a number smaller or equal

than maximum sleeping time). The gaps are filled by interpolating between the end

points in the trajectory.

Minimum Trajectory Length

Once all the trajectories are created, this parameter controls the minimum trajectory

length. All trajectories shorter than this length are deleted.

The last two specifications are optional. In this research work, the last two were not

used. If the minimum trajectory lengths are deleted, that will constitute most of the part

of noise. Only when a sine plus noise model is employed, these specifications will turn

into useful ones.

103

To describe the peak continuation algorithm let us assume that the frequency guides were

initialized with initial guides and that they are currently at frame n. Suppose that the

guide frequencies at the current frame are f1~, f2

~, f3~… fp

~, where p is the number of

existing guides. We want to continue the p guides through the peaks of frame n with

frequencies g1, g2, g3….gm, thus continuing the corresponding trajectories. There are

three steps in the algorithm: (1) Guide advancement, (2) Update of guide values, and (3)

start of new guides. Next these steps are described.

Guide Advancement

Each guide is advanced through frame n by finding the peak closest to its current value.

The rth guide claims frequency gi for which | f~r~gi | is a minimum. The change in

frequency must be less than maximum peak deviation. The possible situations are as

follows:

1. If a match is found within the maximum deviation, the guide is continued (unless

there is a conflict to resolve. The selected peak is incorporated into the

corresponding trajectory.

2. If no match is found, it is assumed that the corresponding trajectory must “turn

off” entering frame n, and its current frequency is matched to itself with zero

magnitude. Since the trajectory amplitudes are linearly ramped from one frame to

the next, the terminating trajectory ramps to zero over the duration of one hop

104

size. Whether the actual guide is “killed” or not, depends on the maximum

sleeping time.

3. If a guide finds a match which has already been claimed by another guide, we

give the peak to the guide that is closest in frequency, and the “loser” looks for

another match. If the current guide loses the conflict, it simply picks the best

available non-conflicting peak which is within the maximum peak deviation. If

the current guide wins the conflict, it calls the assignment procedure recursively

on behalf of the dislodged guide. When the dislodged guide finds the same peak

and wants to claim it, it sees there is a conflict which it loses and moves on. This

process is repeated for each guide, solving conflicts recursively, until all possible

matches are made.

Update of guide values

Once all the existing guides and their trajectories have been continued through frame n,

the guide frequencies are updated. There are two possible situations:

1. If a guide finds a continuation peak, it’s frequency is updated from f ~r to hr

according to

h~r = α (gi – f~

r) + f~r α ∈[0, 1]

Where gi is the frequency of the peak that the guide has found at frame n, and

alpha is the peak contribution to guide. When alpha is 1 the frequency of the peak

trajectory is the same than the frequency of the guide, therefore the difference

Between guide and trajectory is lost.

105

2. If a guide does not find a continuation peak for maximum sleeping time frames, the

guide is killed at frame n. If it is still under the sleeping time it keeps the same

Value (its value can be negated in order to remember that it has not found a peak).

When maximum sleeping time is 0 any guide that does not find a continuation peak at

frame n is killed. In order to distinguish between guides that find a continuation peak at

frame n is killed. In order to distinguish between guides that find a continuation peak

from the ones that do not but still are alive, we refer to the first ones as active guides and

the second ones as sleeping guides.

Start of New Guides

New guides, and therefore new trajectories, are created from the peaks of frame n that are

not incorporated into trajectories by the existing guides. If the number of current guides

is smaller than maximum number of guides a new guide can be started.

A guide is created at frame n by searching through the “unclaimed” peaks of the frame

for the one with the highest magnitude which is separated from every existing guide by at

least minimum staring guide separation. The frequency value of the selected peak is the

frequency of the new guide. The actual trajectory is started in the previous frame, n-1,

where its amplitude value is set to 0 and its frequency value to the same as the current

frequency, thus ramping in amplitude to the current frame. This process is recursively

done until there are no more unclaimed peaks in the current frame, or the number of

guides has reached maximum number of guides.

106

In order to minimize the creation of guides with little chance of surviving, a temporary

buffer is used for the starting guides. The peaks selected to start a trajectory are stored

into this buffer and continued by only using peaks that have not been taken by the

“consolidated” guides. Once these temporary guides have reached a certain length they

become “normal guides”.

Figure 3.14: Peak-continuation process. Here, g represent the guides and p the spectral

peaks. The magnitude, frequency and phase information at p form input to the oscillator

For the case of harmonic sounds these guides could be created at the beginning of the

analysis, setting their frequencies according to the harmonic series of the first

fundamental found, and for inharmonic sounds each guide is created when it finds the

first available peak. When a fundamental has been found in the current frame, the guides

can use this information to update their values. Also the guides can be modified

depending on the last peak incorporated. Therefore by using the current fundamental and

the previous peak we control the adaptation of the guides to the instantaneous changes in

107

the sound. For a very harmonic sound, since all the harmonics evolve together, the

fundamental should be the main control, but when the sound is not very harmonic, or the

harmonics are not locked to each other and we cannot rely on the fundamental as a strong

reference for all the harmonics, the information of the previous peak should have a bigger

weight.

However, we do not focus on just harmonic partials. This project involves an algorithm

that tracks any sound, harmonic or inharmonic. Each peak is assigned to the guide that is

closest to it and that is within a given frequency deviation. If a guide does not find a

match it is assumed that the corresponding trajectory must "turn off". In inharmonic

sounds, if a guide has not found a continuation peak for a given amount of time the guide

is killed. New guides, and therefore new trajectories, are created from the peaks of the

current frame that are not incorporated into trajectories by the existing guides. If there are

killed or unused guides, a new guide can be started. A guide is created by searching

through the "unclaimed" peaks of the frame for the one with the highest magnitude

The peak continuation algorithm presented is only one approach to the peak continuation

problem. The creation of trajectories from the spectral peaks is compatible with very

different strategies and algorithms; for example, hidden Markov models have been

applied. An N Markov model provides a probability distribution for a parameter in the

current frame as a function of its value across the past N frames.

108

With a hidden Markov model we are able to optimize groups of trajectories according to

defined criteria, such as frequency continuity. This type of approach might be very

valuable for tracking partials in polyphonic sounds and complex inharmonic tones. In

particular, the notion of "momentum" is introduced, helping to properly resolve crossing

fundamental frequencies [3].

Additive Synthesis: PSMS

A short tutorial on additive synthesis was given in chapter 2. The peak continuation

algorithm returns the values of the prominent peaks organized into frequency trajectories.

Each peak is a triad (Ar, ω r, φ r) where l is the frame number and r is the track number to

which it belongs.

The synthesis process takes these trajectories, or their modification, and computes one

frame of synthesized sound s (t) by

S(t) = A∑=

Rl

r 1r cos [tω r +φ r ]

Where Rl is the number of trajectories present in frame l and S is the length of the

synthesis frame. A synthesis frame is S samples long and does not correspond to an

analysis frame. Without time scaling the synthesis frame l goes from the middle of

analysis frame l-1 to the middle of analysis frame l, i.e., corresponds to the analysis hop

size. The final sound s (t) results from the juxtaposition of all synthesis frames [2].

109

In SMS scheme usually to avoid click between frames, a peak interpolation strategy is

followed. In this project we replace the peak interpolation with a smooth cubic spline

interpolation stage.

Cubic Spline Interpolation

In any SMS or MQ- Synthesis scheme, a peak continuation scheme will be followed by a

peak interpolation scheme. Peak interpolation is the process of interpolating the

amplitude, frequency and phase parameters of each sinusoidal track smoothly between

frames. Peak interpolation helps in avoiding the sharp clicks between frames. A click or

a crack is an undesirable audio effect and is always subjected for cleaning. In digital

audio restoration, an undesirable noise include, clicks/cracks or steady state hiss. This

hence is an interpolation in frequency domain. However, in this research project we have

replaced this peak interpolation with a smooth cubic spline interpolation method. This

involves smoothly interpolating the abrupt voltage levels between the frame edges. An

usual method of writing a vinyl restoration code is to first detect the click and then fix it.

However, our implementation is simple because, it is already known that the crack exists

only at frame edges.

110

Figure 3.15: (Top) with cracks; (Bottom) without cracks

At the heart of most click removal methods is an interpolation scheme which replaces

missing or corrupted samples with estimates of their true value. It is usually appropriate

to assume that clicks have in no way interfered with the timing of the material, so the task

is then to fill in the ‘gap’ with appropriate material of identical duration to the click. As

discussed with appropriate material of identical duration to the click. This amounts to an

interpolation problem which makes use of the good data values surrounding the

corruption and possibly takes account of signal information which is buried in the

corrupted sections of data. An effective technique will have the ability to interpolate gap

lengths from one sample up to at least 100 samples at a sampling rate of 44.1 KHz [28].

111

Figure 3.16: Cubic spline interpolation

However, the entire process employed here in this work is not automatic. The user has to

specify or change a parameter depending on the sound. A specific number of points are

selected around the crack center spotted at the frame edge, i.e, few points in the previous

frame and few points from current frame and are smoothly interpolated using the cubic

spline interpolation technique. The amount of points that form input to the interpolation

algorithm depends on the nature of sound in particular the wavelength and zero crossings.

For a low frequency sound we might want to specify a big number and for high frequency

sound a small number to arrive at a desirable output.

112

FUSION OF SENSITIVE AND LESS SENSITIVE SPECTRA

Once the PCM samples of the MF region are demodulated to the original frequency band

by the demodulator, we fuse the synthesized bands and the MF band. The synthesized

bands are synthesized from the channel parameters. Here is a model of the fusion.

Figure 3.17: Spectral Fusion (A more general model)

DISTORTION AT TRANSITION BAND EDGES Distortion at transition bands could be expected sometimes. These distortions may be

phase distortions or other type of distortions at the transition edges between frequency

bands. This may be suppressed by using the same filter that we have used before

113

analysis, this time at the synthesis end before fusing the spectral bands to form final

output and this turns successful.

The Downsampling method

This is the second method to be discussed after the two filter method. The downsampling

method is applicable only for very low frequency sounds. According to the duplex

theory of pitch perception, low frequencies have very poor frequency resolution but

better time resolution and high frequencies have better spectral resolution and poor time

resolution. At 640 Hz both, temporal and pattern recognition effects are sensed [7].

Figure 3.18: The downsampling method.

114

The reason is because of the fact that the human auditory system follows time impulses in

the LF region. The downsampling method targets two things. The first one is that since

the LF sinusoids have poor spectral resolution, the Low-frequency sound is downsampled

by a factor two in time. This shifts all the tracks by a factor two on frequency scale.

Therefore the band rejection filter must be designed with cutoff frequencies that are two

times the cutoff frequencies that rejected the sensitive mid band in the two filter method.

The pitch increases by a factor two and time decreases by a factor two. Hence spectral

resolution of tracks is increased. This facilitates the peak tracking process.

The second target is data reduction. Data reduction will also be achieved because the

time is compressed by a factor two when we downsample. Finally, before resynthesizing,

a modification factor two is set and frequency parameters were divided by two to get

back the original time. However, this method is applicable only for very low frequency

sounds because the sampling rate of the signal should be taken into consideration before

shifting the frequency components. Moreover, the sound is comparatively defective if

modification algorithms are not robust. Though the downsampling method only fits the

low frequency instruments like bass guitar or kick drum, downsampling method could be

effectively used in the four filter method that follows.

115

Duplex theory of pitch perception: Applications: Four-Filter method

Figure 3.19: High Spectral resolution Four-filter method

The four-filter PSMS makes use of the duplex theory of pitch perception discussed in

chapter 2. The dual purpose of the four filter method is more spectral resolution for

better tracking and re-synthesis and more chances of data reduction. The four filter

method improves the two filter method in terms of fidelity, and data reduction.

Moreover, logically it sets a unique example for better ways to analyze signals,

“truthfully” in future. This way the tradeoff in time and frequency resolution can be

compensated to some extent. According to the duplex theory of pitch perception, the low

116

frequencies have poor spectral resolution and high time resolution whereas the high

frequencies have poor time resolution and high spectral resolution.

Figure 3.20: A schematic picture explaining how the analysis frame length is changed

over the frequency scale.

At a 640 Hz cross over frequency, both temporal and pattern recognition resolution is

there. Regardless of the matter which domain, that we need to set high resolution, we

need spectral frequency parameters that are very close to what would be original when

we synthesize the sound. Therefore spectral resolution (number of bins) cannot be

varied. Hence, a variable analysis time frame length is employed to achieve a better

PSMS system. A narrow analysis frame length (35-45 msec) is set for low-frequency

sounds and a wider analysis frame length (100 msec) is set for high frequency sounds.

The downsampling method is optional because, downsampling method works well only

for low-frequency harmonic signals.

117

Figure 3.21: The four filter method

The input sound is filtered using four band pass filters. A DC to 0.7 kHz filter, 0.5 kHz

to 1 kHz, 0.9 kHz to 6.1 KHz (mid) and a 5.9 kHz to 20 KHz were used to filter the

different frequency bands. Each filtered frequency band except the mid band form input

to the sinusoidal model. Variable time frame length overlapping frames were set before

sending the pass band signal into the sinusoidal model. A low frequency band has a short

analysis frame length and a high frequency has a little long frame length.

This method is bit expensive but the resolution is comparatively good and it is an

extremely good tactic for enhanced data reduction.

118

Figure 3.22: (Top) High resolution (Bottom) Low spectral resolution

Analysis of spectral resolution of low frequency sounds (0-700 Hz)

Note: There is no zero padding involved in the above plots

119

However, the parametric information through the channel for both two-filter method and

four-filter method almost remains the same. A two filter method captures large number

of parameters from two wide spectral bands whereas a four frequency method captures

less number of parameters from three spectral bands.

Discarding Phase in HF regions

In HF regions, the ear follows the amplitude envelope of the frequency spectrum and

leaves out its phase content. This psychological evidence was used in our sinusoidal

model by discarding the HF phase parameters that are to be fed into the oscillator.

In the four filter method, the fourth filter chops off the lower spectral details, leaving us

with only high frequency spectral energy. The amplitude and frequency parameters

corresponding to this HF band were transmitted. The phase parameters were discarded

because they have no perceptual importance in the auditory scene. The experiment was

conducted with a cymbal crash sound and the results are shown in Figure 3.23.

120

Figure 3.23 (a): Synthesizing HF band with phase parameters

Figure 3.23 (b): Synthesizing HF band without phase parameters

121

Figure 3.24: Plots explaining the synthesis of different band that have various analysis

frame length and their final fusion (Four filter method)

122

Hence, it was experimentally verified that the phase does not have very significant sonic

meaning in the high frequencies. This way the four filter method was improved by

compressing more audio data by chopping off useless audio data that traverses the

channel to the synthesis port. Experiments were conducted both on non harmonic

instruments like cymbal crash and harmonic instruments like harmonium. The results of

the experiment clearly indicate that phase information is not perceived in higher

frequencies. Therefore high-frequency phase parameters can be discarded because they

do not have much perceptual importance. Hence, one-third of the high-frequency

information through the channel is reduced.

POSSIBILITIES OF MODFIFICATION

Figure 3.25: Possibilities of Modifications

123

The most powerful option of the analysis-synthesis based schemes is that the music after

analysis can be modified. The modifications can be different music applications for

music production and composition like the traditional time/pitch scale modifications,

reverberation, chorusing, harmonizing and other music production applications.

Our synthesis based method proposes some modifications. However the modifications

are little complex and involves lot of computation compared to other schemes. The

sinusoidal parameters are modified using a modification instruction to get the desired

modification effect for the LF and HF regions. The mid spectrum is modified using a

phase vocoder to attain a desired sound effect. The two sounds are added together. The

phase vocoder used here is a commercially available tool.

Cross Effects

Figure 3.26: Cross Effect using PSMS modifications

124

The unique modification possibility in this project would be to introduce meaningful

cross effects. For example we may obtain a pitch shift effect by modifying the sinusoidal

parameters and use a harmonizing effect for mid spectral samples using a phase vocoder.

Here is an example above where the mid spectrum is introduced to chorusing effect using

a commercial phase vocoder and the frequency parameters of the LF-HF sinusoidal

parameters are multiplied by random numbers. The cross effect provides some

meaningful “digital effect” as output.

Demonstration: Advantages over sine plus noise model

The fidelity and data reduction are increased if a traditional sinusoidal plus noise model is

replaced with a pure sinusoidal model as followed in this thesis work (two filter method).

Noise generally consists of musical excitation and other noisy components. Infact SMS

is sometimes used in denoising applications. While a pure deterministic model consists

of only perfect harmonic tracks, the noise may have the nearby harmonic frequencies of

the deterministic tracks. These near by harmonics contribute a lot to the music. When

these nearby harmonics are modeled using a LPC estimate, the user need to send LPC

coefficients close to analysis frame length to get the best estimate in order to minimize

the error. Therefore it becomes an additional burden on the channel. If less LPC

coefficients are sent, the nearby harmonics are not properly modeled as stochastic noise

when the LPC filter is excited with a white noise.

125

In most of the real world cases, most of the non musical noises are associated in LF or

HF regions as rumblings (LF) or hiss sounds (Tape HF). In the two filter method, we

send all the sensitive nearby harmonics as PCM samples. The noises in the LF and HF

that are less sensitive are modeled stochastically. Therefore the resulting residual sound

comprising of mid nearby harmonics and musical noise plus stochastic LF and HF noise

sounds better than a full stochastic model. Moreover, the number of LPC coefficients

will also decrease by a greater factor. This becomes a better model in all terms. A

demonstration by pictures and plots is shown below.

Figure 3.27 (a)

126

Figure 3.27 (b)

Figure 3.27 (c)

127

Figure 3.27 (d)

Figure 3.27 (e)

128

Figure 3.27 (f)

Figure 3.27 (g)

129

Figure 3.27 (h)

Figure 3.27 (i)

130

Figure 3.27 (j)

Figure 3.27 (k)

Figure 3.27(a-k): Demonstration: Disadvantages of sine plus noise model in a two filter method

131

CHAPTER 4

TESTS AND RESULTS

Tests were conducted to analyze the success level of this research project. The results of

the same were plotted. The two filter method and four filter method were analyzed

qualitatively and the bit reduction and audio compression ratio were tabulated.

Qualitative analysis

Listening tests were conducted on 12 subjects. Five had musical ears and the rest

weren’t. The subjects were instructed to rate the quality of the output sound by

comparing it to the input sound. They were specifically asked to judge and take into

account any possible distortions in the output. They were asked to rate the output sound

ranging from 1 to 5, 1 being worst and 5 being best. Decimal values in between these

integers were allowed so that the subject could rate the sound to the best possible extent

while making critical judgments. The qualitative results for the two filter method are

plotted below:

132

Figure 4.1: Qualitative Results: Music Genre: Two filter method

Figure 4.2: Qualitative Results: Tonal Instruments: Two filter method

133

Figure 4.3: Qualitative Results: Percussion Instruments: Two filter method

The qualitative results for the four filter method are plotted below:

Figure 4.4: Qualitative Results: Music Genre: Four filter method

134

Figure 4.5: Qualitative Results: Tonal Instruments: Four filter method

Figure 4.6: Qualitative Results: Percussion Instruments: Four filter method

135

Bit rate calculation

The bit rates for all the above sounds were calculated using both two filter method and

four filter method. The mid frequency sensitive signal that forms input to the channel as

PCM samples were reduced by an audio compression ratio 4:1. For example, a sound

sampled at a CD rate of 44.1 KHz, 705 Kbps (mono) will be reduced to 176 kbps. The

parameters (Amplitudes, frequencies, Phases) were converted to binary format to

calculate the number of bits per second. The number of kbps from LF-HF sinusoidal

parameters were added to the number of kbps from the sensitive PCM samples to find the

bit rate and audio compression ratio. All the audio files were mono.

Music Instrument/Genre

Two filter Method

Four filter method

Rock 44.1 KHz, 705 kbps

343.58 kbps (PSMS) + 176 (mid) kbps = 519.58 kbps (3:2)

23.194 kbps (LF1) + 19.405 kbps (LF2) + 176 kbps (mid) + 83.421 (HF) = 302.02 kbps (2:1)

Classical 44.1 KHz, 705 kbps

66.134 kbps (PSMS) + 176 kbps (mid) = 242.134 kbps (3:1)

12.635 kbps (LF1) + 13.632 kbps (LF2) + 176 kbps(mid) + 11.824 kbps (HF) = 214.091 kbps (3:1)

Country 44.1 KHz, 705 kbps

101.32 kbps (PSMS) + 176 kbps = 277.32 kbps (2:1)

18.021 kbps (LF1) + 20.453 kbps (LF2) + 176 kbps (mid) + 23.735 kbps (HF) = 238.209 kbps (3:1)

Jazz 44.1 KHz, 705 kbps


14.318 kbps (LF1) + 15.04 kbps (LF2) + 176 kbps + 34.173 kbps (HF) = 239.531 kbps (3:1)

136

Speech 38 KHz, 608 kbps



Gottuvadhyam 44.1 KHz, 705 kbps

84.310 kbps (PSMS)+ 176 kbps (mid) = 260.310 kbps (3:1)


Pipa 48 KHz, 768 kbps


12.707 kbps (LF1) + 17.179 kbps + 192 kbps (mid) + 13.539 kbps (HF) = 235.425(3:1)

Sitar 44.1 KHz, 705 kbps



Flute 44.1 KHz, 705 kbps

49.375 kbps + 176 kbps (mid) = 225.375 kbps (3:1)


Violin 44.1 KHz, 705 kbps



Piano 24 KHz, 384 kbps

14.835 kbps (PSMS) + 96 kbps(mid)= 110.835 kbps (4:1)

5 kbps (LF1) + 3.464 kbps (LF2) + 96 kbps (mid) + 1.815 kbps (HF) = 106.279 kbps (4:1)

Acoustic Guitar 32 KHz, 512 kbps

14 kbps (PSMS) + 128 kbps(mid) = 142 kbps

5.364 kbps (LF1) + 3.697 kbps (LF2) + 128 kbps

137

( 3:1 )

(mid) + 0.256 kbps (HF) = 137.317 kbps ( 4:1)

Electric Bass 30 KHz, 480 kbps



Low Tom 44.1 kbps, 705 kbps


3.554 kbps (LF1) + 5.1 kbps (LF2) + 176 kbps (mid) + 6.155 kbps (HF)= 190.809 kbps (3:1)

High Tom 44.1 kbps, 705 kbps



Closed snare 44.1 kbps, 705 kbps



Open Hi-hat 44.1 kbps, 705 kbps



Mid-Tom 44.1 kbps, 705 kbps

53.304 kbps + 176 kbps (mid) = 229.304 kbps (3:1)

6.532 kbps (LF1) + 3.91 kbps (LF2) + 176 kbps (mid) + 13.507 kbps (HF) = 200 kbps (3:1)

Cymbal Crash 44 KHZ, 705 kbps



Table 4.1: Bit rates and compression ratios for two filter and four filter method

138

Earlier, in the third chapter, we mentioned that one need to go below the threshold of

hearing to pick peaks because the peaks that were not audible may be perceived after

modifications. However, the primary focus of this research is data reduction and

modifications have lesser significance. Therefore, when one wants to have good bit rate

reduction, the factor below threshold of hearing must be set to zero.

139

CHAPTER 5

CONCLUSION

It was analyzed that the interpolation of voltages between time frames was not

completely successful. Therefore the output quality was affected to a smaller extent.

However, if a commercial SMS system is used, this partial synthesis idea can certainly

better the existing schemes by means of data reduction and fidelity. However the

flexibility in modification is affected due the “partial synthesis”. The bit reduction rates,

compression ratio are mentioned in the last table of this chapter. Finally the various

features of this synthesis scheme are compared to normal perceptual coding schemes and

presented in that table.

Future Extensions

The future extensions of this project include transient modeling of the low and high

frequency sounds using the sine plus transient plus noise scheme mentioned in [27]. The

entire project could be implemented for MIDI files. There is a curve dip at the HF rage

of the Fletcher-Munson contours. This dip could be made use of as we did for mid

frequency dip in this research. However, the dip is not a common one to all age groups.

Old people sometimes don’t have access to the high frequencies. A run length coding

could be done for parameters representing frequency tracks by sending the (frequency

parameter, length of the string) as mentioned in [23]. One another attempt would be to

use loudness variable band pass and band rejection filters in the two filter method.

140

However, this may not induce a big change in data reduction. A perceptual coder may be

used for coding the mid frequency signals in the future.

PROS AND CONS: A COMPARTITIVE STUDY

Feature

Perceptual Audio Coding

PSMS

Bit Allocation

Dynamic & Adaptive

There is no bit pool but bits are

assigned based on sensitivity and is

a one step external decision and no

adaptive techniques involved.

Bit Rate

Fixed bit rates

Depends on the sound

Psychoacoustic

base point

Loudness perception & less of

pitch perception

Pitch perception

Channel Noise

No

Chances do prevail being a

communication system

141

Applications

Storage space reduction,

Television & Radio broadcast

Satellite transmission,

Military & Musical applications.

Data reduction, Television & Radio

broadcast Satellite transmission,

Military & Musical applications.

Commercial

Usage

Used commercially today

Future commercial product

Table 5.1: Pros and cons: Perceptual coding and PSMS

Instead of the SMS/MQ synthesis technique that was used for synthesizing the less

sensitive LF-HF regions, other synthesis techniques could be tried to verify and test if

which synthesis technique could be the best fit for the system.

A low-bit-rate coder could be used to code the sensitive signal instead of the entire signal.

This will facilitate accurate coding of sensitive signal. The less sensitive signals can be

synthesized using SMS. Both the synthesized signal and low bit rate coded signal may be

added and quality may be ensured.

142

BIBLIOGRAPHY: 1. Fletcher, Munson.W.A, “Loudness, its definition, measurement and calculation,“ Journal of Acoustical Society of America, 5 (1933) : 82-108. 2. Xavier Serra, “A system for analysis/transformation/synthesis based on a deterministic plus stochastic decomposition” (Ph.D. diss., Stanford University, 1989). 3. Tuomas Virtanen, “Audio Signal Modeling with Sinusoids plus Noise” (Master of Science Thesis, Tampere University of technology). 4. Jean Laroche, Mark Dolson, “New phase-vocoder techniques for pitch-shifting, harmonizing and other exotic effects,” Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, New York, 1999) , 17-20. 5. McAulay, R.J, Quatieri,T.F, “Speech analysis/synthesis based on a sinusoidal representation” IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-34, no.4 (1986): 744-754. 6. Albert.Bregman, Auditory Scene Analysis, (Cambridge: MIT Press, 1994). 7. Perry R. Cook, Music, Cognition and Computerized sound, Introduction to Psychoacoustics, (Cambridge: MIT Press, 2001) 8. Ken C. Pohlmann, Principles of Digital audio, Fourth Edition, (New York: McGraw-Hill Publications, 2000). 9. Curtis Roads, The Computer Music Tutorial, (Cambridge: MIT Press, 1996). 10. T. H. Andersen and K. Jensen, “Importance and representation of phase in the sinusoidal model,” Journal of the Audio Engineering Society, 52 no.11 (2004) :1157-1169. 11. Serra, X, Smith, J, “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition”, Computer Music Journal 14, no.4 (1990):12-24 12. Website: http://www.mmk.ei.tum.de/persons/ter.html 13. Terhardt, E and Seewann, S.G “Algorithm for extraction of pitch and pitch salience from complex tonal signals,” Journal of the Acoustical Society of America 71 (1982): 679-688 14. Ernst Terhardt, “Calculating virtual pitch” Hear. Res 1 (1979): 155-182 15. Kelly Fitz, “The Reassigned Bandwidth-Enhanced Method of Additive Synthesis,"

http://www.mmk.ei.tum.de/persons/ter.html

143

(Ph. D. diss., University of Illinois at Urbana-Champaign, 1999). 16. Scheirer, Eric D., and Barry L. Vercoe, “SAOL: The MPEG-4 Structured Audio Orchestra Language,” Computer Music Journal 23 no.2 (1999): 31-51. 17. Brian CJ.Moore, An Introduction to the Psychology of Hearing (San Diego: Academic Press, 2003) 18. Plomp, “Pitch of complex tones,” Acoustic society of America, 41, no.6 (1967): 1526-1533 19. Rodet, X. and P. Depalle, “Spectral Envelopes and Inverse FFT Synthesis”, 93rd Convention of Audio Engineering Society (San Francisco, 1992) 20. McAulay, R.J. and T.F. Quatieri, “Speech Transformations based on a sinusoidal representation,” IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP 34 no. 6 (1986): 1449-1464. 21. Eronen A, and Klapuri A, "Musical Instrument Recognition Using Cepstral Coefficients and Temporal Features,” In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2000, (Istanbul, 2000), 753-756. 22. J. B. Allen and S. T. Neely, 1997. “Modeling the relation between the intensity just noticeable difference and loudness for pure tones and wideband noise,” Journal of Acoustical Society of America, 102 no. 6 (1997). 23. Sylvain Marchand, “Compression of sinusoidal modeling parameters”, Proceedings of the COST G-6 conference on digital audio effects (DAFX-00), (Verona, 2000), 273-276 24. T. H. Andersen and K. Jensen. “Phase models in analysis/synthesis of voiced sounds”, In Proceedings of the DSAGM, (Copenhagen, 2001). 25. Charles Dodge and Thomas. A. Jerse, Computer music synthesis, composition and performance, second edition (New York: Prentice Hall Series) 26. Thomas Quatieri, Discrete time speech signal processing (New Jersey: Prentice Hall Series, 2001) 27. T. Verma, T. Meng, “Extended spectral modeling synthesis with transient modeling synthesis,” Computer music journal, 24 no.2 (2000): 47-59. 28. Simon J.Godsill and Peter J.W.Reyner, Digital audio restoration, (New York: Springer series, 1998

144

APPENDIX:

Sample plots obtained from experiments

Rock music, sampled at 44.1 KHz, 16 bit, mono, 705 kbps

Country music, sampled at 44.1 KHz, 16 bit, mono, 705 kbps

145

Speech, sampled at 44.1 KHz, 16 bit, mono, 705 kbps

Chinese Pipa, sampled at 44.1 KHz, 16 bit, mono, 705 kbps

146

Gottuvadhyam and Tabla, sampled at 44.1 KHz, 16 bit, mono, 705 kbps

Sitar and Tabla, sampled at 44.1 KHz, 16 bit, mono, 705 kbps

HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION …mue.music.miami.edu/thesis/Arvind_Venkata... · The Human auditory system is very sensitive to the mid frequency range (1-5

Documents