HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS
Master’s Thesis submitted to the faculty of University of Miami in partial
fulfillment of the requirements of the degree of Master of Science
by
Arvind Venkatasubramanian
Music Engineering Technology, Frost School of Music, University of Miami , P.O.Box 248165 Coral Gables, FL 33124-7610
May 2005 Research Advisor: Thesis Panel: Mr. Colby N. Leider Mr. Kenneth C. Pohlmann Assistant Professor, Director of Music Engineering Music Engineering Technology Frost School of Music Frost School of Music University of Miami, Coral Gables University of Miami, Coral Gables Dr. Edward P. Asmus Associate Dean, Graduate Music Studies, Frost School of music, University of Miami, Coral Gables
University of Miami
Master’s Thesis submitted to the faculty of University of Miami in partial fulfillment of the requirements of the degree of Master of
Science
HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS
Approved: _____________________________ Ken C. Pohlmann Director of Music Engineering _____________________________ Colby N. Leider Assistant professor of Music Engineering _____________________________ Dr. Edward P. Asmus Associate Dean of Graduate Studies
VENKATASUBRAMANIAN, ARVIND (M.S., Music Engineering) High-Fidelity, Analysis-Synthesis Data Rate (May 2005) Reduction for Audio Signals Abstract of the master’s research project at the University of Miami Research project Supervised by Assistant Professor Colby Leider Number of pages in text: 146 Powerful music-compression algorithms have facilitated greater audio data reduction
today. This paper is about a basic communication system that synthesizes part of the
audio information for which the human auditory system is not relatively sensitive and
adding them to audio information to which they are sensitive, thus maintaining the
originality of the signal in the sensitive region. One of the biggest advantages in writing
data-reduction algorithms of today is to avoid coding for music data, which humans do
not hear. The Fletcher-Munson curves depict the loudness perception scheme of the
human auditory system. The auditory system is very sensitive to the mid-frequency
region compared to low-frequency and high-frequency regions of the human hearing
range. In the proposed coder, the mid frequency PCM information from the audio data is
transmitted as such through the channel. This involves modulation at the transmitter end
to reduce the sampling rate through the channel and demodulation at receiver end to
recover back the message. A sinusoidal model is used to synthesize the audio data
corresponding to the low-frequency and high-frequency regions. The sinusoidal model
involves short time Fourier analysis that extracts meaningful parameters that are fed into
an oscillator at the synthesis end to reconstruct the sound. Therefore, modifications are
possible before resynthesis.
This method is two-filter method and the application is an alternate for the existing audio
data-reduction algorithms that rely on psychoacoustic models and perceptual coding.
Because no approximation is made to the signal that lies in the region of greatest
sensitivity, the application tends to be a perceptually transparent. Low-frequencies have
poor spectral resolution than high frequencies. Therefore the two-filter model was
improved by downsampling the input, causing a pitch shift. This enables the SMS
system [2] to track clear sinusoids and because the input is down sampled, the data and
computation cost are halved. Modifications are set to recover back the original time
length.
One another model called the four-filter method, which outperforms the partial
analysis/synthesis system (PSMS), was simulated based on the duplex theory of pitch
perception by using a variable time-frame length for different spectral bands. The four-
filter method was improved to one another level. In high-frequency regions, the ear
follows the amplitude envelope of the frequency spectrum and leaves out its phase
content. This psychological evidence was used in our sinusoidal model by discarding the
high-frequency phase parameters that are to be fed into the oscillator. The results showed
that the four-filter method is better than the two-filter method both in quality sense and in
data rate reduction. The high-frequencies synthesized without phase information
sounded same as the high-frequencies synthesized with phase. This showed that phase of
high-frequency signals do not have much perceptual importance.
DEDICATION
To human empiricism that protects humanity, accepting the pluralism, as a mark of respecting the uncertainty that rules my humbleness & humility.
v
ACKNOWLEDGEMENT I would like to thank my family, relatives, and friends. I would like to extend a word of
thanks to Dr.Srinivasan Narasimhan. I acknowledge the theses committee members for
their time. I would like to thank my academic mentor Ken Pohlmann, for being patient
giving me enough time, encouraging me at right times and for his understanding. I would
like to thank and appreciate my theses supervisor Colby Leider, for introducing me to
intellectual music and for guiding me through this research by contributing his ideas. His
knowledge in sound synthesis took me towards this final report, as without his guidance
this work would not be possible. His guidance helped me sail on right track till the end.
I extend my thankfulness to Dr.Asmus for his help to me during summer 2003. My
sincere thanks to all my high school friends and teachers, undergraduate friends and
teachers, Joe, Music Engineering friends and the Indian graduate friends at the UM. I
appreciate those who participated in listening tests. I’m thankful to Girish, Department
of Psychiatry and Department of Gerontology at the UM, who gave me a job. I thank Ali
Habashi for his help. I would like to thank Dr.Shariar Negadaripour (EE), Dr.Murat
Dogruel (EE), Dr.Micheal Scodrilis (EE), Dr.Don Wilson (Composition), Dr.Modestino
(EE), Dr.Mermelstein (EE) and Dr.Moeiz Tapia (EE) under whom I underwent course
works and class projects at the UM. These courses and projects have indirectly helped
me in this thesis.
vi
Table of Contents: Chapter 1: Introduction ...................................................................................................1 Chapter 2: Literature Review..........................................................................................7
• Pitch Perception .......................................................................................7
Psychoacoustics of music...........................................................................7 Ear mechanisms and human auditory system............................................9 Duplex theory of pitch perception............................................................14 Missing fundamental effect ......................................................................15 Virtual and Spectral pitch ........................................................................17
• Fletcher-Munson Curves.......................................................................23 • Basic Communication System...............................................................27
• Analysis-Resynthesis..............................................................................31
Fourier Philosophy: Fast Fourier Transform Analysis...........................32 Classical theory of timbre ........................................................................36 Overview of sound synthesis techniques ..................................................37 Spectral modeling synthesis .....................................................................43 MQ-Synthesis ...........................................................................................49 Bandwidth Enhanced Sinusoidal Modeling .............................................56
• Phase Synthesis.......................................................................................59
• Perceptual Coding V Partial Synthesis Based Data Reduction.........63 Chapter 3: The Research: A Partial synthesis based audio data reduction..............75
• On the use of pitch perception in data reduction................................75 Fletcher and Munson Curves: Data Sets .................................................78
• The General Procedure: Two-filter method.........................................83
Modulation and Demodulation: Mid-frequency band .............................85 FFT Analysis of low-frequency and high-frequency bands ......................89 Peak detection ...........................................................................................94 Peak Continuation ....................................................................................99 Additive Partial Sinusoidal Synthesis (PSMS)........................................108 Cubic Spline Interpolation ......................................................................109 Fusion of the sensitive and less sensitive data........................................112
vii
• Downsampling method .........................................................................113
• Duplex theory of pitch perception: Applications ...............................115
Improving Partial SMS: Four-filter method ...........................................115 Discarding HF Phase .............................................................................119 Possibilities in modifications ..................................................................122
• Advantages: Experimenting with a sine plus noise model ...............124
Chapter 4: Results: .......................................................................................................131
• Listening tests and results ...................................................................132
• Data reduction and results ..................................................................135 Chapter 5: Conclusion..................................................................................................139
• Future extension of the project...........................................................139
Pros and Cons: Perceptual coding V PSMS ........................................140 BIBLIOGRAPHY..........................................................................................................142 APPENDIX.....................................................................................................................144
viii
List of Figures: Figure 1.1: Overview of the proposed coder Figure 2.1: The second dimension of pitch: chroma Figure 2.2: The Missing fundamental effects Figure 2.3: Fletcher-Munson Curves Figure 2.4: Stages of Communication Figure 2.5: Basic communication system Figure 2.6: Types of communication Figure 2.7: Overview of general analysis and synthesis technique Figure 2.8: (a-c) Nyquist sampling theorem Figure 2.9: Additive Synthesis Figure 2.10: The amplitude progression of the partials of a trumpet tone Figure 2.11: SMS: Block diagram of the analysis process [2] Figure 2.12: SMS: Block diagram of the synthesis process [2] Figure 2.13: McAulay-Quatieri Sinusoidal Analysis-Synthesis system: [5] Figure 2.14: Mcaulay-Quatieri Sinusoidal model for speech [5] Figure 2.15: Peak detection in MQ-approach,: [5] Figure 2.16: Mcaulay-Quatieri Sinusoidal Analysis-Synthesis (Peak picking) [5] Figure 2.17: Lemur Figure 2.18: Lemur Graphical tool Figure 2.19: Bandwidth enhanced sinusoidal modeling Figure 2.20: MPEG audio compression and decompression Figure 3.1: Perceptual Coding approach Vs Synthesis based approach Figure 3.2: Fletcher Munson original curves (Fig 3 [1]) Figure 3.3: Figure 2 mentioned in [1] Figure 3.4: Figure 3 in [1] Figure 3.5: MALAB plot of figure 3.4 Figure 3.6: The Schematic Block Diagram employed in our Synthesis based Data Reduction (Two filter method) Figure 3.7: The Two Filter method: Band Pass and Band elimination filters Violin spectrum, FS = 44100, mono, 16 bit Figure 3.8: Modulation and demodulation of sensitive data (Country music, 44.1 kHz, 16
bit, mono at 705 Kbps) Figure 3.9: Original and Windowed short time signals, Fourier analysis (Hanning Window, 75% Overlap) Figure 3.10: Magnitude and Phase spectrum of LF and HF bands Figure 3.11: Peak detection in LF and HF bands Figure 3.12: Missed Peaks Figure 3.13: Peaks below threshold Figure 3.14: Peak Continuation process Figure 3.15: Crack Removal (Top) with cracks; (Bottom) without cracks Figure 3.16: Cubic spline interpolation: Crack Removal Figure 3.17: Spectral Fusion (A more general model)
ix
Figure 3.18: The downsampling method Figure 3.19: High Spectral resolution Four-filter method Figure 3.20: A schematic picture explaining how the analysis frame length is changed over frequency scale. Figure 3.21: The four filter method Figure 3.22: (Top) High resolution (Bottom) Low spectral resolution Analysis of spectral resolution of low frequency sounds (0-700 Hz) Figure 3.23: Synthesizing HF band with phase parameters and without phase parameters Figure 3.24: Plots explaining the synthesis of different band that have various analysis Frame length and their final fusion (Four filter method) Figure 3.25: Possibilities of Modifications Figure 3.26: Cross Effect using PSMS modifications Figure 3.27: Demonstration: Advantages of sine plus noise model in a two filter method Figure 4.1: Qualitative Results: Music Genre: Two-filter method Figure 4.2: Qualitative Results: Tonal Instruments: Two-filter method Figure 4.3: Qualitative Results: Percussion Instruments: Two-filter method Figure 4.4: Qualitative Results: Music Genre: Four-filter method Figure 4.5: Qualitative Results: Tonal Instruments: Four-filter method Figure 4.6: Qualitative Results: Percussion Instruments: Four-filter method Appendix: Figures: Rock music, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Country music, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Speech, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Chinese Pipa, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Gottuvadhyam and Tabla, sampled at 44.1 kHz, 16 bit, mono, 705 kbps Sitar and Tabla, sampled at 44.1 kHz, 16 bit, mono, 705 kbps
x
3
List of Tables:
Table 2.1: Table 2.1: Critical Bandwidth as a function of center frequency and critical
Band rate: [8]
Table 3.1: Fletcher Munson curve: data sets
Table 4.1: Bit rates and audio compression ratio (Two-filter and four-filter method)
Table 5.1: Pros and cons: Perceptual coding V PSMS
xi
1
CHAPTER 1
INTRODUCTION
The aim of this research project is to present a communication system that encodes less
information so as to decode all the necessary information, at the same time, improving
the fidelity of existing synthesis-based data reduction algorithms. The primary motive of
data reduction is achieved by following a simple tactic.
The Human auditory system is very sensitive to the mid frequency range (1-5 kHz) of the
spectrum. Experiments show that critical bands are much narrower at low frequencies
than at high frequencies; three-fourths of the critical bands are below 5 kHz; the ear
receives more information from low and mid frequencies and less from high frequencies.
We transmit the audio PCM data which represent the mid-frequency spectrum above the
human threshold of hearing through the communication channel. A simple modulation
technique is followed by modulating the mid-frequency pass band to the base band and
downsampling in time. At the receiver end the modulated data is upsampled and then
demodulated so that the message is recovered back. This facilitates data reduction.
The Sinusoidal model of Xavier Serra [2] is used to synthesize the low-frequency and
high-frequency bands of the spectrum. The audio for the Low-frequency and high-
frequency regions of the human auditory range are synthesized and received at the
receiver end of the Communication System. A band elimination filter is used to
eliminate the sensitive frequency band for which humans are sensitive. The outputs of
the band-elimination filter comprising the less-sensitive low and high ends of the
2
spectrum become inputs to the sinusoidal model. A band pass filter is used to filter the
sensitive frequencies of the mid spectrum. Since two filters are used, this method is
called the two filter method. The Sinusoidal modeling involves a short-time Fourier
analysis of overlapping time frames. Each time frame is windowed before analysis. The
short-time Fourier transform gives the spectral details of the current frame. A simple
peak detection algorithm detects vital peaks i.e., local maxima in the spectrum. The
amplitudes of the peaks and their corresponding frequencies and phases will form the
inputs of the oscillator at the synthesis end. Before synthesis, a peak continuation
algorithm connects the spectral points (peaks) to form the sinusoids, commonly called the
tracks. The connected tracks are smoothly interpolated from frame to frame as said in the
sinusoidal modeling synthesis and Mcaulay-Quartieri algorithm. However, the peak
interpolation of frequency-domain parameters are replaced in this thesis work by a
smooth cubic spline interpolation of abruptly changing voltage levels between frames in
the time domain. This helps in reducing the computational expenses.
The stochastic signals in the sensitive mid spectral band would be transmitted as PCM
data. Therefore, the sinusoidal plus noise model was replaced with a mere sinusoidal
model in which even noise is modeled into tracks. However, a typical SMS is a sine plus
noise model. Shorter tracks are usually deleted and are stochastically analyzed by a
linear prediction method. A convincing nth-order polynomial fit of the stochastic
frequency response is possible at the synthesis end if we could send linear prediction
coefficients through the channel. White noise could be used as excitation into the linear
prediction filter to get a stochastic approximation of the noise. Stochastic modeling is not
3
included in this project. It is always better to model a deterministic plus stochastic model
because in real world cases, physical music signals are made up of sinusoids and musical
noise (excitation). Moreover, human beings do not follow exact phase information for
noisy transient sounds. They do not carry perceptually meaningful information in most
cases. Even though, stochastic model approach is not followed in this thesis, a
demonstration of advantages of including stochastic analysis in the two filter method is
discussed briefly.
Figure 1.1: Overview of the proposed coder
A logical note about this system is that the low frequencies have poor frequency
resolution compared to mid and high frequencies according to the duplex theory of pitch
perception. We use the SMS technique for synthesizing the LF region. Henceforth, false
sinusoidal trajectory connections and trajectory breaks are expected not to show up or not
to be audible as artifacts in the low frequencies of the output sound. Though this is a
logical conclusion, engineering is all about improving models. Hence, this research work
4
also introduces a new four filter method that models the sound better giving more hopes
and chances for high fidelity and tactical data reduction.
The four-filter method involves two filters in the low-frequency region, one in the mid-
(sensitive region) and one in the high-frequency region. Based on the frequency location
of the local spectral band, the frame length for analysis is changed in the time domain
that would result in appropriate frequency resolution in the frequency domain. Low
frequencies have poor spectral resolution. The auditory system follows the time impulses
for low frequencies. Hence, the analysis time frame length is set to a small value to
capture time resolution and this value increases for each filter, as we move to the high-
frequency end. This not only results in smoother sinusoidal connections but also
manages to save the amount of parameters that has to be sent through the channel for
proper sound reconstruction.
The four-filter method was improved to one another level. In HF regions, the ear follows
the amplitude envelope of the frequency spectrum and leaves out its phase content. This
psychological evidence was used in our sinusoidal model by discarding the HF phase
parameters that are to be fed into the oscillator.
The other way to improve the system is to shift the low frequency region way up on the
frequency scale by a downsampling in time by a factor two. The sinusoids would have
better resolution now and the trajectory-tracking will be smooth and comfortable. At the
same time the duration of the music will be halved. This will reduce the number of
5
parameters by a factor two. After getting the parameters, the frequency parameters are
divided by a factor two and sent through the channel. Modifications are applied to time-
stretch the signal back to original length. If the downsampling factor becomes greater,
the musician could hear the “warbling” effect, which is undesirable artifact in this case.
The chapter three contains more information on the sinusoidal trajectory tracking.
In this complex world, perfection is never achieved. This applies to this project too. The
success level of this algorithm is based on how close the systems audio output is when
compared to original audio. The complexity, in getting near to physical world’s real
sounds, is very difficult to represent, even when we break the complexity into separate
dimensions of time, frequency and amplitude. To represent the complexity, we need the
magnitude response followed by phase response of the systems. Therefore, the success
of this synthesis based music data reduction is all on how close the synthesized music is
to the original music.
The synthesis methods we used here are the Music sinusoidal plus noise modeling based
on the past research work by Xavier Serra [2]. Sounds produced by musical instruments
and other physical systems can be modeled as a sum of deterministic and stochastic parts,
or as a sum of sinusoids plus noise residual. Sinusoidal are produced by a harmonic
vibrating system. The residual contains the energy produced by the excitation
mechanisms and other components which are not result of periodic vibration [3]. This
synthesis method is applicable only for musical purposes. A more general scheme that
fits any sound, sometimes even noise, is the Mcaulay-Quatieri algorithm [5]. Our system
6
in this research, works for any type of sound ranging from tonal to non tonal to noisy
transient sounds. Our system allocates more bits to the human sensitive signals. A
perceptual coder has to analyze a short time signal to adaptively allocate more bits to
meaningful music signal, code fewer bits to less meaningful non musical signal and does
not allocate bits for useless information. If low-bit-rate coding could be used to code the
mid-frequency sensitive signals, this research scheme could facilitate avoiding all
adaptive bit allocation computational cost because this becomes a direct choice for an
engineer to switch over from sensitive to less sensitive signal by a one step external
decision. Hence, if low-bit rate coding could be used for coding sensitive frequencies,
the possibilities for avoiding computational costs increase. At the same time, high
fidelity could be attained because the algorithm focuses on sensitive frequencies granting
bits liberally from the bit pool. However, this work will be reserved for the future. The
third chapter contains the research project which is more detailed.
7
CHAPTER 2
LITERATURE REVIEW
PSYCHOACOUSTICS OF MUSIC
Psychoacoustics is the study of human auditory perception, ranging from the biological
design of the ear to the brain’s interpretation of aural information. Sound is only an
academic concept without our perception of it. Psychoacoustics is the branch of study
that explains the subjective response to everything we hear. It is only our response to
sound that fundamentally matters. Psychoacoustics seeks to reconcile acoustical stimuli
and all the scientific, objective, and physical properties that surround them, with the
physiological and psychological responses evoked by them.
Psychoacoustics can be defined simply as the psychological study of hearing. The aim of
psychoacoustic research is to find out how hearing works. In other words, the aim is to
discover how sounds entering the ear are processed by the ear and the brain in order to
give the listener useful information about the world outside.
The ear and its associated nervous system is an enormously complex, interactive system.
The physiology of the human hearing system has evolved incredible powers of
perception. At the same time it has its limitations. The ear is astonishingly acute in its
ability to detect nuance or defect in a signal. It is also with portions of the signal that do
8
not have perceptual importance. The accuracy of a coded signal can be very low, but this
accuracy is very frequency-dependent and time-dependent.
The ear is a highly developed physical organ (the eye, for example can only receive
frequencies over one octave), but the ear is useful only when coupled to the interpretative
powers of the brain. Those mental judgments form the basis for everything we
experience from sound and music. The left and right ears do not differ physiologically in
their capacity for detecting sound, but their respective right and left brain halves do. The
two halves loosely divide the brain’s functions. [8]
PITCH AND PITCH PERCEPTION
Pitch refers to the tonal height of a sound object, e.g. a musical tone or the human voice.
The use of the term pitch is, however, often inconsistent in that the term is both used for a
stimulus parameter (i.e.,, synonymous to frequency) and for an attribute of auditory
sensation. The people concerned with processing of speech mostly use the term in the
former sense, meaning the fundamental frequency (oscillation frequency) of the glottal
oscillation (vibration of the vocal folds). In psychoacoustics (and so in the present
discussion) the term is used throughout in the latter sense, i.e., meaning an auditory
(subjective) attribute. The ANSI definition of psycho acoustical terminology says that
“pitch is that auditory attribute of sound according to which sounds can be ordered on a
scale from low to high”. To date this definition still is a useful basis, though it must be
complemented by taking account of certain additional aspects. [12]
9
EAR MECHANISMS IN PITCH PERCEPTION
The ear performs the transformation from acoustical energy to mechanical energy and
ultimately to the electrical impulses sent to the brain, where information contained in
sound is perceived. The outer ear collects sound and its intricate folds help us to assess
directionality. The ear canal resonates at about 3 kHz, providing extra sensitivity in the
frequency range critical for speech intelligibility. The eardrum transduces acoustical
energy into mechanical energy; it reaches maximum excursion at about 120 dB SPL,
above which it begins to distort the waveform. Three bones in the middle ear,
colloquially known as hammer, anvil and stirrup provide impedance-matching to
efficiently convey sounds in air to the fluid filled inner ear.
The coiled basilar membrane detects the amplitude and frequency of sound; those
vibrations are converted to electrical impulses and sent to the brain as neural information
along a bundle of nerve fibers. The brain decodes the period of stimulus and point of
maximum stimulation along the basilar membrane to determine frequency activity in
local regions surrounding the stimulus.
Examination of the basilar membrane shows that the ear contains roughly 30,000 hair
cells arranged in multiple rows along the basilar membrane, roughly 32 mm long; this is
the organ of corti. The cells detect local vibrations of the basilar membrane and convey
audio information to the brain via electrical impulses. Frequency discrimination dilates
that at low frequencies, tones a few Hertz apart can be distinguished; however at high
10
frequencies, tones must differ by hundreds of hertz. In any case, hair cells respond to the
strongest stimulation in their local region; this is called a critical band, a concept
introduced by Harvey Fletcher. Experiments show that critical bands are much narrower
at low frequencies than at high frequencies; three-fourths of the critical bands are below 5
KHz; the ear receives more information from low frequencies and less from high
frequencies. Critical bands are approximately 100 Hz wide for frequencies from 20 to
400 Hz and approximately 1/5 octave in width for frequencies from 1 to 7 KHz. Previous
research shows that critical bands can be approximated with the equation:
Critical Bandwidth in Hertz = 24.7(4.37F + 1)
Where F = center of frequency in kHz. [12]
The bark is the unit of perceptual frequency; a critical band has a width of one bark;
1/100 of a bark equals 1 Mel. The bark scale relates absolute frequency (in Hertz) to
perceptually measured frequencies such as pitch or critical bands. Using a bark scale, the
physical spectrum can be converted to a psychological spectrum. In this way, a pure tone
(a single spectrum line) can be represented as a psychological masking curve.
The pitch place theory further explains the action of the basilar membrane. Carried by
the surrounding fluid, a sound wave travels the length of the membrane and the wave
stops at particular places along the length of the membrane, where the greatest vibration
of the membrane occurs, corresponding to different frequencies. Specifically, high
frequencies are sensed at the membrane near the middle ear while low frequencies are
11
sensed at the farther end. The wave excited by a high-frequency sound does not reach the
far end of the basilar membrane. However, a low-frequency sound will pass through all
high frequency places to reach the far end. Because hair cells tend to vibrate at the
frequency of strongest stimulation, they will convey that frequency in a critical band,
ignoring lesser stimulation. This excitation curve is described by the cochlear spreading
function, an asymmetrical contour. Critical bands are important in perceptual coding
because they show that the ear discriminates between energy in the band, and energy
outside the band; in particular, this promotes masking. [8]
Perception of pitch is a complicated issue. The definitions that pitch is a sensory
(subjective) attribute tend toward a concept that includes not only the aspect of perceived
height but in addition one or even more aspects of tones that are relevant in music. The
most prominent of these additional aspects is octave-equivalence the notion that tones an
octave apart are somehow similar and so, in certain musical respects, "equivalent". Pitch
must be regarded as a two-dimensional attribute such that height is only one of two
dimensions. The second "dimension" ordinarily is termed chroma. According to this
concept a pitch is said to have both a certain height and a certain musical-categorical
value (chroma), e.g. "c-ness", "d-ness", etc. This is often illustrated by Roger Shepard’s
helical model. In that model, pitches are represented by points on an ascending helix such
that the vertical height of their position reflects pitch height, while the rotational angle of
the position corresponds to chroma. On the helix pitches with one and the same chroma
(in music denoted by the same letter, c, d, e, etc.) are situated vertically above or below
one another. Practically any sound of real life including the tones of musical instruments
12
evokes several pitches at a time, though often (in particular for the harmonic complex
tones produced by conventional musical instruments) one of them is most prominent and
then is said to be the pitch. So the weak point of the ANSI definition is that there is no
guarantee that any sound having pitch indeed can unambiguously be positioned on the
low-high dimension. The auditory system also gets confused when a C and the very G
are played simultaneously, one tone perceived by the right and one by the left ear. [12]
When one listens to a pair of successive musical tones, one can ordinarily tell whether or
not the tones are equal in pitch; or if the first is higher in pitch than the second; or vice
versa. However, even for ordinary musical tones there is octave equivalence, which
means that tones may be confused with one another although their oscillation frequencies
differ by a factor of two. This implies that for harmonic complex tones there exists a
certain ambiguity of pitch which naturally emerges from the multiplicity of pitch.
The ambiguity of pitch can be much amplified by suppressing certain harmonics from the
Fourier spectrum of a "natural" harmonic complex tone. Shepard has described
observations on harmonic complex tones whose Fourier spectrum consisted only of
harmonics that were in an octave relationship, i.e.,, the 1st, 2nd, 4th, 8th, 16th, etc. While
the musical pitch class (the chroma) of such tones is well defined, the absolute height of
pitch is quite ambiguous; that is, octave confusions are very likely to occur.
13
Figure 2.1: The second dimension of pitch: Chroma [12]
This is particularly true when the frequency of the first harmonic is near the lower limit
of the hearing range while the upper part tones extend up to the high end of the hearing
range. In that case, there indeed is little that if any information available to the ear about
what actually is the fundamental frequency (oscillation frequency).
When, for instance, the oscillation frequency of the above type of tone is 10 Hz and the
number of part tones chosen is 11, the listener is exposed to a spectrum of part tones with
the frequencies 10, 20, 40, 80, ... ,10240 Hz. When that tone is followed by another
having twice the oscillation frequency of the first, the listener gets exposed to 20, 40, 80,
till 20480 Hz, and it is not surprising that one will not perceive much of a difference, if
any. So, under these conditions there is "perfect" octave equivalence.
From this notions it is easy to understand that when the ratio between the oscillation
frequencies of the two tones is 1.414, the listener on first sight cannot be expected to be
14
able to tell whether the second tone is higher in pitch than the first or vice versa. The
tritone paradox, originates from the observation that listeners in fact do make fairly
consistent decisions on which of the two tones is higher in pitch, i.e.,, whether they heard
an upward or downward step of pitch. However, while the responses of individual
listeners are fairly consistent and reproduceable, different listeners may give opposite
responses. Moreover, the responses of individual listeners turn out to be dependent on the
absolute height of the oscillation frequencies. That is, when the listening experiment is
made with a base frequency of, e.g., 12 Hz instead of 10 Hz, the individual responses
may systematically change. This was regarded as a particularly "paradox" outcome.
The basic aspects of the tritone paradox can be fairly well explained by the theory of
virtual pitch. However, the theory cannot account for the observed individual
differences, as the factors governing those differences as yet are unknown. [12]
DUPLEX THEORY OF PITCH PERCEPTION
Generally the ear mechanism works differently for low-frequency and high-frequency
pitch perception. At very low frequencies, we may hear successive features of a
waveform so that it is not heard as having just one pitch. The ear follows the energy
envelope in the LF region on the time scale. It takes into account the number of time
bursts per second and hence pitch sensation is based on periodicity. For high-frequency
contents the ear takes the position of vibration along basilar membrane of the cochlea into
account. In HF regions, the ear follows the amplitude envelope of the frequency
spectrum and leaves out its phase content. The two mechanisms appear to be about
equally effective at a frequency around 640 Hz. This is popularly called the Duplex
15
theory of pitch perception. [7] For frequencies well above 1000 Hz, the pitch frequency is
heard only when the fundamental is actually present.
THE MISSING FUNADAMENTAL EFFECT
When two single-frequency tones are present in the air at the same time, they will
interfere with each other and produce a beat frequency. The beat frequency is equal to the
difference between the frequencies of the two tones and if it is in the mid-frequency
region, the human ear will perceive it as a third tone, called a “subjective tone" or
"difference tone".
When two sound waves of different frequency approach the ear, the alternating
constructive and destructive interference causes the sound to be alternatively soft and
loud – a phenomenon, that is called “beating”. The beat frequency is equal to the absolute
value of the difference in frequency of the two waves.
The subjective tones, which are produced by the beating of the various harmonics of the
sound of a musical instrument help to reinforce the pitch of the fundamental frequency.
Most musical instruments produce a fundamental frequency plus several higher tones,
that are whole-number multiples of the fundamental. The beat frequencies between the
successive harmonics constitute subjective tones that are at the same frequency as the
fundamental and therefore reinforce the sense of pitch of the fundamental note being
played. If the fundamental is 50 Hz and its two successive harmonics 150 Hz and 200 Hz
16
beat with each other, a 50 Hz is the resultant which equals the fundamental and hence
reinforces the pitch. If the lower harmonics are not produced because of the poor fidelity
or filtering of the sound reproduction equipment, humans still hear the tone as having the
pitch of the non-existent fundamental because of the presence of these beat frequencies.
This is called the missing fundamental effect. It plays an important role in sound
reproduction by preserving the sense of pitch (including the perception of melody) when
reproduced sound loses some of its lower frequencies. The presence of the beat
frequencies between the harmonics gives a strong sense of pitch.
Figure 2.2: The missing fundamental effect
Fletcher in his first paper proposed that the missing fundamental was re-created by non-
linearties in the mechanism of the ear. He soon abandoned this false conclusion and in
his second paper described experiments saying that “a tone must include three successive
partials in order to be heard as a musical tone, a tone that has the pitch of the
fundamental, whether or not the fundamental is present”. [7]
17
For fundamental frequencies of up to about 1400 Hz, the pitch of a complex tone is
determined by the second and higher harmonics and not by the fundamental, whereas
beyond this frequency the opposite holds; this is the case both for tones with harmonics
of equal amplitude and for tones with harmonics of which the amplitudes fall by 6
dB/octave. For fundamental frequencies of up to about 700 Hz, the pitch is determined
by the third and higher harmonics; for frequencies up to about 350 Hz, by the fourth and
higher harmonics [18].
SPECTRAL PITCH AND VIRTUAL PITCH
Pitch of sine tones with high probability is a "place pitch", i.e.,, is dependent on the place
of maximal excitation of the cochlear partition and eventually is a result of peripheral
auditory Fourier analysis. On the other hand, it was evident that the pitch of many types
of complex tone cannot be explained by that principle in particular, the pitch of harmonic
complex tones whose fundamental Fourier component is weak or entirely missing.
Attempts have been made to resolve this conflict by searching for a parameter of sound
and a mechanism that accounts both for the pitch of sine tones and of complex tones. The
search is a failure to this day. Besides the pitch of sine tones there is another type of
pitch, namely, virtual Pitch. The conceptual distinction between spectral pitch and virtual
pitch is that both spectral pitch and virtual pitch ultimately are dependent on aural Fourier
analysis; however, while any spectral pitch is conceived as immediately corresponding to
a spectral singularity, virtual pitch is modeled as being deduced from a set of spectral
pitches on another stage of auditory processing. The relationship between spectral pitch
18
and virtual pitch is analogous to that between primary and virtual visual contour in many
respects.
There is hardly any sound that does not elicit any spectral pitch at all. The harmonic
complex tones of real life, i.e.,, voiced speech and musical tones are aurally represented
by a number of spectral pitches that correspond to the lower harmonics. The formants of
speech vowels elicit corresponding spectral pitches. Even random sound signals often are
either "colorized" by spectral irregularities, which will give rise to steady spectral pitches,
or there can occur instantaneous irregularities in the short-term Fourier spectrum that
elicit spectral pitches of which both the height and the instant of occurrence are random.
When any real-life sound (e.g., foot step, knock at the door, splashing water, sound of a
car's engine, and fricative phoneme of speech) can be identified by ear, one can be sure
that - besides temporal structure - spectral pitch is involved. Spectral pitch is the most
important carrier of auditory information, as it is an element of higher-order, Gestalt-like
types of auditory percepts such as, e.g., the pitch of musical tones, the strike note of bells,
the root of musical chords, and the quality of a particular vowel.
The telephone channel does not distort the pitch of speech, although transmission is
confined to the frequency range from about 300 to about 3400 Hz. When we suppress
bass reproduction, we will notice that the fundamental becomes inaudible, while the
speaker's pitch continues to be well reproduced. The kind of pitch of the fundamental
that we may hear if the fundamental is strong enough, is the pitch of a sine tone; it is of
the spectral pitch type. The pitch that we ordinarily hear, however, is not dependent on
the fundamental is being audible; it is by the auditory system extracted from a range of
19
the Fourier spectrum that extends above the fundamental. The latter type of pitch is
termed virtual pitch.
A procedure was described for the automatic extraction of the various pitch percepts
which may be simultaneously evoked by complex tonal stimuli. The procedure is based
on the theory of virtual pitch, and in particular on the principle, that the whole pitch
percept is dependent both on analytic listening (yielding spectral pitch), and on holistic
perception (yielding virtual pitch). The more or less ambiguous pitch percept governed
by these two pitch modes is described by two pitch patterns: the spectral-pitch pattern,
and the virtual-pitch pattern. Each of these patterns consists of a number of pitch (height)
values and associated weights, which account for the relative prominence of every
individual pitch. The spectral-pitch pattern is constructed by spectral analysis, extraction
of tonal components, evaluation of masking effects (masking and pitch shifts), and
weighting according to the principle of spectral dominance. The virtual-pitch pattern is
obtained from the spectral-pitch pattern by an advanced algorithm of sub-harmonic
coincidence assessment.
It can be concluded that, as an attribute of auditory sensation, virtual pitch is
fundamentally different in type from spectral pitch. This conclusion is strongly suggested
by the fact that one can hear both types of pitch at a time, having the same height.
Evidently, it is possible to communicate one and the same pitch (in terms of pitch height)
through two drastically different perceptual "channels": Spectral pitch is communicated
immediately, i.e., by a Fourier component's frequency, while virtual pitch is
20
communicated by providing to the auditory system information about the oscillation
frequency of a complex signal that is implied in the Fourier spectrum as a whole.
Formation of virtual pitch can essentially be said to be a process of subharmonic
matching. The tonal aspects of any sound are primarily represented by a set of spectral
pitches, and pertinent virtual pitches are "inferred" on the basis of the presumption that in
any case they must be subharmonic to the spectral pitches. The virtual pitch mechanism
deals with both "harmonic" and "inharmonic" sounds as well, though internally it strictly
sticks to the presumption that each and every virtual-pitch candidate must be a
subharmonic of a spectral pitch.
Where the partials in a sound are harmonically related, but with the first member of the
series missing (for example, a sound with partials at 500Hz, 750Hz, 1000Hz, 1250Hz
etc.) a virtual pitch can be heard at 250Hz - the missing fundamental. Where the partials
are not exactly harmonic, a virtual pitch is still heard at about the same place, but the
exact frequency turns out to be determined in quite a complicated way by the frequencies
of the individual partials. No comprehensive rule for determining virtual pitch is yet
known. Extensive research has been done on virtual pitch and it has been proved that it is
not due to a simple explanation such as difference tones, but rather a side effect of the
human hearing mechanism. There is no doubt that the strike note of a bell is a virtual
pitch, as will be explained below. Virtual pitch effects often dominate spectral pitches,
for example, in bells the strike note is about an octave below the nominal even if the
tierce, only a minor third away, is very strong.
21
Frequencies of partials present in sounds can be measured with scientific instruments, or
spectrum analyzers. Pitches cannot be measured with instruments. They exist only in our
perception of a sound. Only a human listener can tell us the pitch of a sound - and
different listeners may disagree on the perceived pitch. [12]
Spectral pitch is defined as an elementary auditory object that immediately represents a
spectral singularity. The simplest and most prominent example is the pitch of a sine tone.
A virtual pitch is characterized by the presence of harmonics or near-harmonics. A
spectral pitch corresponds to individual audible pure-tone components. Most pitches
heard in normal sounds are virtual pitches and this is true whether the fundamental is
present in the spectrum or not. The crossover from virtual to spectral pitch to be about
800 Hz, but this method depends on selection of clear sinusoidal components in the
spectrum. This follows the duplex theory of pitch perception. [7]
A procedure for the schematic and automatic extraction of "fundamental pitch" from
complex tonal signals, such as voiced speech and music was developed by Ernst
Terhardt. While the aurally relevant "fundamental" of a complex signal cannot be defined
in purely mathematical terms, an existent model of virtual-pitch perception provided a
suitable basis [13]. The procedure comprised the formation of determinant spectral
pitches (which correspond to the frequencies of certain signal components), and the
deduction of virtual pitch (or "fundamental frequency") from those spectral pitches. The
latter deduction was accomplished by a principle of sub-harmonic matching, for whose
realization a simple, universal, and efficient algorithm was found. While the calculation
may be confined to the determination of "nominal" virtual pitch, certain typical auditory
22
phenomena, such as the influence of SPL, partial masking and interval stretch, were
accounted well, in which the 'true' virtual pitch was obtained [14]. An algorithm for
extraction of pitch and pitch salience from complex tonal signals is mentioned in [13].
The core idea behind the project presented here is to make use of this natural
phenomenon of Virtual pitch in audio data reduction. The human sound perception is not
sensitive to the detailed spectral shape or phase of non-periodic sounds. The sinusoidal
model used in this research work takes advantage of the human inability to perceive the
exact spectral shape of signals. Even phase of transient noisy signals has limited
perceptual importance. [3]
In pure tones, because the frequency composition is so simple, no distinction can be made
between three different properties of a tone; its fundamental frequency, its pitch and its
spectral balance. The case is different with complex tones. Let us take the three
properties in turn. The fundamental frequency of a tone is the frequency of the repetition
of the waveform. If the tone is composed of harmonics of a fundamental, this repetition
rate will be at the frequency of the fundamental regardless of whether the fundamental is
actually present. In harmonic tones, the perceived pitch of the tone is determined by the
fundamental of the series, despite the fact that it is not present. This phenomenon is
sometimes called “the perception of the missing fundamental”. Finally, the spectral
balance is the relative intensity of the higher and lower harmonics. This feature controls
our perception of the brightness of the tone [6].
We will make great use of these pitch-related concepts, especially the missing
fundamental effect and duplex theory of pitch perception in this research project in an
23
effective manner which follows in Chapter 3. While synthesizing the low-frequency and
high-frequency spectrum, our system might not follow all the exact harmonics and their
corresponding energy levels. The missing fundamental effect explained above therefore
is usefully applied here to mask the absence of missing partials in the case of music and
absence of formants in speech applications. Minor errors while tracking the sinusoids
hence would be less perceived. Only major tracking errors in SMS will clearly show up.
THE FLETCHER MUNSON CURVES, 1933
Figure 2.3: The Fletcher Munson Equal Loudness Curves (1933)
In 1933, Fletcher and Munson decided to gather some information about how we
perceive different frequencies at different amplitudes. They came up with the Equal
Loudness Contours or the Fletcher and Munson Curves. These curves give information
24
on the threshold of hearing at different frequencies and the apparent levels of equal
loudness at different frequencies.
The ear is not equally sensitive to all frequencies, particularly in the low and high
frequency ranges. The response to frequencies over the entire audio range has been
charted, originally by Fletcher and Munson in 1933, with later revisions by other authors,
like Robinson and Dadson, as a set of curves showing the Sound Pressure Levels (SPL)
of pure tone’s that are perceived as being equally loud. The curves are plotted for each 10
dB rise in level with the reference tone being at 1 kHz.
The curves are lowest in the range from 1 to 5 kHz, with a maximum dip around 3300
Hz, indicating that the ear is most sensitive to frequencies in this range. The intensity
level of higher or lower tones must be raised substantially in order to create the same
impression of loudness. The phon scale was devised to express this subjective impression
of loudness, since the decibel scale alone refers to actual sound pressure or sound
intensity levels.
Historically, the A, B, and C weighting networks on a Sound level meter were derived as
the inverse of the 40, 70 and 100 dB Fletcher-Munson curves and used to determine
Sound level. The lowest curve represents the threshold of hearing, the highest the
threshold of pain. The actual data sets and their requirement in this research will be
explained in detail in Chapter 3.
25
Dynamic Range
The instruments used to measure the magnitudes of sounds respond to changes in air
pressure. However, sound magnitudes are often specified in terms of intensity, which is
the sound energy transmitted per second (i.e.,, the power) through a unit area in a sound
field. For a medium such as air, there is a simple relationship between the pressure
variations of a plane sound wave in a free field (i.e, in the absence of reflected sound) and
the acoustic intensity; intensity is proportional to the square of the pressure variation. [17]
The difference in sound-pressure level between the saturation or overload level and the
background noise level of an acoustic or electro-acoustic system, measured in decibels.
This range may be expressed as a signal-to-noise ratio for maximum output. For a sound
or a signal, its dynamic range is the difference between the loudest and quietest portions.
The human hearing system has a dynamic range of about 120 dB between the threshold
of hearing and the threshold of pain.
The intensity level where a sound becomes just audible is the threshold of hearing. For a
continuous tone of between 2000 and 4000 Hertz, heard by a person with good hearing
acuity under laboratory conditions, this is 0.0002 dyne/cm2 Sound Pressure and is given
the reference level of 0 dB.
While 0 dB is the reference employed, the threshold of hearing varies considerably with
lower and higher frequencies. This curve is also called minimum audible field (MAF).
Alternate units for this reference level are: 2 x 10-4 microbar (µbar); 2 x 10-5 Newton/m2
(N/m2); 2 x 10-5 Pascal (Pa); 20 micro Pascal (µPa).
26
Threshold of pain is the intensity level of a loud sound which gives pain to the ear,
usually between 115 and 140 dB.
The square of the sound pressure is proportional to sound intensity. SPL can be
calculated in the same manner and is measured in Decibels.
SPL = 10 log (r/rref) 2 = 20 log (r/rref)
Where r is the given sound pressure and rref is the reference sound pressure.
Decibel is the unit of a logarithmic scale of power or intensity called the power level or
intensity level. The decibel is defined as one tenth of a bel where one bel represents a
difference in level between two intensities I1, I0 where one is ten times greater than the
other.
Intensity level = 10 log10 (I1 /I0) (dB)
Because of the very large range of Sound Intensity which the ear can accommodate, from
the loudest (1 watt/m2) to the quietest (10-12 watts/m2), it is convenient to express these
values as a function of powers of 10. The result of this logarithmic basis for the scale is
that increasing a sound intensity by a factor of 10 raises its level by 10 dB; increasing it
by a factor of 100 raises its level by 20 dB; by 1,000, 30 dB and so on. When two sound
sources of equal intensity or power are measured together, their combined intensity level
is 3 dB higher than the level of either separately. 0 dB is defined as the threshold of
hearing, and it is with reference to this internationally agreed upon quantity that decibel
measurements are made.
27
Phon is a unit used to describe the loudness level of a given sound or noise. The system is
based on equal loudness Contour, where 0 phons at 1,000 Hz are set at 0 decibels, and the
threshold of hearing at that frequency. The hearing threshold of 0 phons then lies along
the lowest equal loudness contour. If the intensity level at 1,000 Hz is raised to 20 dB, the
second curve is followed.
For the purpose of measuring sounds of different loudness, the sone scale of subjective
loudness was invented. One sone is arbitrarily taken to be 40 phons at any frequency, i.e.,
at any point along the 40 phon curve on the graph. Two sones are twice as loud, e.g. 40 +
10 phons = 50 phons. Four sones are twice as loud again, e.g. 50 + 10 phons = 60 phons.
The relationship between phons and sones is shown in the chart, and is expressed by the
equation:
Phon = 40 + 10 log2 (Sone)
BASIC COMMUNICATION SYSTEM BLOCK
The research project described in chapter 3 is basically a transmitting and receiving
communication system. Therefore an overview of very basic functional communication
block is given here. Today, Communication has entered our lives in so many different
forms, that it is very difficult to lead a life without various appliances and tools born out
of communication. Communication is the process of conveying something from one
point to other. It can be classified into communication within line of sight and distance
between the transmitter and receiver.
28
If the two points are beyond the line of sight, then the branch of communication,
engineering comes into picture, and it is known as telecommunication engineering.
Figure 2.4: Stages of communication
In communication engineering, the physical message such as sound, words, pictures etc.,
are converted into equivalent electrical values, called signals. This electrical signal is
conveyed to a distant place, through a communication media, and at the receiving end,
this electrical signal is reconverted back into the original message through some media.
Figure 2.5: Basic communication system
29
Source
The message produced by the source is not necessarily electrical in nature, but it may be
a voice signal, a picture signal etc. So, an input transducer is required to convert the
original physical message into a time varying electrical signal. These signals are called
base band signals (or) message signals or modulating signals. At the destination another
transducer is used to convert the electrical signal into appropriate message.
Transmitter
The transmitter comprising electrical and electronic components converts the message
signal into a suitable form for propagating over the communication medium. This is
often achieved by modulating the carrier signal (i.e.,, high frequency signal to carry the
modulating or message signal) which may be an electromagnetic wave. This wave is
often referred as modulated signal.
Modulation and Demodulation
Modulation is the process by which some characteristics of a high frequency carrier
signal is varied in accordance with the instantaneous value of the another signal called
modulating or message signal. Signal containing information or intelligence to be
transmitted is known as modulating or message signal. It is also known as base band
signal. The term base band designates the band of frequencies representing the signal
supplied by the source of information. Usually the frequency of carrier is greater than the
modulating signal. The signal resulting from the process of modulation is called
30
modulated signal. Demodulation is the process of getting back the original signal from
the channel and it is done before receiving the signal at the receiving end.
Channel
The transmitter and the receiver are usually separated in space. The channel provides
connection between the source and the destination. Regardless of its type, the channel
degrades the transmitted signal in a number of ways which produces the signal distortion.
This occurs in a channel, due to imperfect response of channel bandwidth and
contamination of signals due to channel noise.
Receiver
The main function of the receiver is to extract the message signal from the degraded
version of transmitted signal. The transmitter and receiver are carefully designed to
avoid distortion and minimize the effect of the noise from the receiver. So that faithful
reproduction of the message emitted by the source is possible. The receiver has the task
of operating on the received signal so as to reconstruct the recognizable form of the
original message signal and to deliver it to the user destination.
31
Communication types
Figure 2.6: Types of communication
This research will focus on Digital communication of mid-frequency band PCM samples
as indicated in this figure above plus transmission of spectral parameters of the low-
frequency and high- frequency bands.
ANALYSIS/RESYNTHESIS
Analysis-Resynthesis is a technique in which the input signal is analyzed for a short time
and the spectrum of the same is computed. The musician makes the needed modification
and the desired sound is resynthesized in the final stage. One of the major application
tools which use this technique is the phase vocoder [4].
The analysis of a sound, to identify the harmonics that occur in the sound signal, is
performed through the estimation of its power spectrum. Samples of musical tones are
analyzed using the Short-Time Fourier Analysis, to determine the time-varying frequency
characteristics. The analysis is carried out on short segments of the input signal through a
32
technique called windowing, with window widths based on the amplitude envelope
parameters, obtained through analysis of the amplitude of the waveform with respect to
time. The Fast Fourier Transform (FFT) is then applied to these discrete sections to
obtain a spectrum of the sound signal. This data is then used for resynthesis of the
original sound. The spectral model synthesis employed in this research work works in
the same way.
Figure 2.7 Overview of general analysis and synthesis technique
THE FOURIER PHILOSOPHY: DISCRETE FOURIER TRANSFORM
Continuous
For a continuous function of one variable f(t), the Fourier Transform F(f) will be defined
as:
33
F(f) = f(t) e∫∞
∞−
-j2π ft dt
And the inverse transform as
f(t) = F(f) e∫∞
∞−
j2π ft df
Where j is the square root of -1 and e denotes the natural exponent
ejφ = cos(φ ) + j sin(φ )
Discrete
Consider a complex series x(k) with N samples of the form
x0, x1, x2,…….xk……….xN-1
Where x is a complex number
Xi = Xreal + j Xi mag
Further, assume that that the series outside the range 0, N-1 is extended N-periodic, that
is, xk = xk+N for all k. The FT of this series will be denoted X(k), it will also have N
samples. The forward transform will be defined as
x(n) = (1/N) e∑−
=
1
0)(
N
kkx -jk2π n/N
for n = 0…N-1
The inverse transform will be defined as
34
x(n) = e∑−
=
1
0)(
N
kkx jk2π n/N
for n = 0…N-1
Of course although the functions here are described as complex series, real-valued series
can be represented by setting the imaginary part to 0. In general, the transform into the
frequency domain will be a complex valued function, that is, with magnitude and phase.
Magnitude = ||X (n) || = (Xreal * Xreal + Xi mag * Xi mag) 0.5
Phase = tan-1(Xi mag/ Xreal)
The Nyquist Criterion and Sampling Theorem
The sampling theorem (often called "Shannon’s Sampling Theorem") states that a
continuous signal must be discretely sampled at least twice the frequency of the highest
frequency in the signal.
More precisely, a continuous function f(t) is completely defined by samples every 1/fs (fs
is the sample frequency) if the frequency spectrum F(f) is zero for f > fs/2. Fs/2 is called
the Nyquist frequency and places the limit on the minimum sampling frequency when
digitizing a continuous signal.
Normally the signal to be digitized would be appropriately filtered before sampling to
remove higher frequency components. If the sampling frequency is not high enough the
high frequency components will wrap around and appear in other locations in the discrete
spectrum, thus corrupting it.
35
The key features and consequences of sampling a continuous signal can be shown
graphically as follows.
Consider a continuous signal in the time and frequency domain.
Figure 2.8 (a) Fourier transform (continuos)
Sample this signal with a sampling frequency fs, time between samples is 1/fs. This is
equivalent to convolving in the frequency domain by delta function train with a spacing
of fs.
Figure 2.8 (b) Fourier transform (Discrete)
If the sampling frequency is too low the frequency spectrum overlaps, and become
corrupted. This is called aliasing.
36
Figure 2.8 (c) Aliasing
Another way to look at this is to consider a sine function sampled twice per period
(Nyquist rate). There are other sinusoid functions of higher frequencies that would give
exactly the same samples and thus can't be distinguished from the frequency of the
original sinusoid.
CLASSICAL THEORY OF TIMBRE
An overview of timbre definitions and theory behind is provided here because
modifications are possible in this research project. The chapter three contains details of
creating various sound effects by modifications.
International Standards Organization & American National Standards Institute:
"Timbre is that attribute of auditory sensation in terms of which a listener can judge that
two sounds similarly presented and having the same loudness and pitch are dissimilar."
DIMENSIONS OF TIMBRE
A considerable amount of effort has been done in order to find the perceptual dimensions
of timbre, the ‘color’ of a sound. Often these studies have involved multidimensional
37
scaling experiments, where a set of sound stimuli is presented to subjects, who then give
a rating to their similarity or dissimilarity. On the basis of these judgments a low-
dimensional space, which best accommodates the similarity ratings, is constructed and a
perceptual or acoustic interpretation is searched for these dimensions.
Two of the main dimensions described in these experiments have usually been spectral
centroid and rise time. The first measures the spectral energy distribution in the steady
state portion of a tone, which corresponds to perceived brightness. The second is the time
between the onset and the instant of maximal amplitude.
The psychophysical meaning of the third dimension has varied, but it has often related to
temporal variations or irregularity in the spectral envelope. These available results
provide a good starting point for the search of features to be used in musical instrument
recognition systems [21]. Since this research project does not need focus on timbre
theory except for modification parts, timbre is not focused here more. In a pure
sinusoidal model, one has access to a particular frequency in the name of “track”, and the
track can be modified, and new timbres can be created. In this project, there are
opportunities for modifying tracks to create partial timbre modifications. Examples of
that kind will be provided in the third chapter.
AN OVERVIEW OF SOUND-SYNTHESIS TECHNIQUES
When generating musical sound on a digital computer, it is important to have a good
model whose parameters provide a rich source of meaningful sound transformations.
38
Three basic model types are in prevalent use today for musical sound generation:
instrument models, spectrum models, and abstract models. Instrument models attempt
to parametrize a sound at its source, such as a violin, clarinet, or vocal tract. Spectrum
models attempt to parametrize a sound at the basilar membrane of the ear, discarding
whatever information the ear seems to discard in the spectrum. Abstract models, such as
FM, attempt to provide musically useful parameters in an abstract formula. The following
passages will be an overview of widely used sound synthesis techniques for music
purposes.
Additive Synthesis
The Philosophy behind all sound synthesis methods is the Euler’s philosophy which
states that any physical world sound we hear can be broken down to a bunch of sinusoids
viz sine waves and cosine waves. These building blocks can be subjected to mathematical
operations and desired sounds can be synthesized. This is what exactly the Fourier
transform uses in its time to frequency mapping. Additive synthesis is one of the oldest
and most heavily researched synthesis techniques is also based on the summation of
elementary waveforms to create more complex waveforms. This technique is accepted as
the most powerful and flexible spectral modeling technique. It was among the first
synthesis techniques in computer music. It was described extensively in the very first
article of the very first issue of the Computer Music Journal. Additive synthesis technique
assumes that any periodic waveform can be modeled as a sum of sinusoids at various
amplitude envelopes and time-varying frequencies. Basically, it puts together a number
of different wave components together, which can be partials or harmonics, to arrive at a
39
particular sound. Figure shows the additive synthesis function by summing up sinusoids
in order to form specific waveforms. Additive synthesis allows more control than any
other kind of synthesis, as it permits fine control over individual frequency components.
Moreover, a synthesis may interpolate between the frequency spectra from two or more
different sounds. Additive synthesis is effective in the modeling of steady state sounds
rather than portions of sound like the transients in the attack part of the sound.
Figure 2.9: Additive Synthesis
The phase factor: Phase is a trickster. Depending on the context, it may or may not be a
significant factor in additive synthesis. For example, if one changes the starting phases of
the frequency components of a fixed waveform and resynthesizes the tone, this makes no
difference to the listener and yet such a change may have a significant effect on the visual
40
appearance of the waveform. Phase relations become apparent in the perception of the
brilliant but short life of attacks, grains, and transients. The ear is also sensitive to phase
relationships in complex sounds where the phases of certain components are shifting over
time.
Addition of partials is limited in that it succeeds only in creating a more interesting fixed
waveform sound. Since the spectrum in fixed waveform synthesis is constant over the
course of a note, partial addition can never reproduce accurately the sound of an acoustic
instrument. It approximates only the steady state portion of an instrumental tone.
Research has shown that the attack portion of a tone, where the frequency mixture is
changing on a millisecond-by-millisecond timescale, is by more useful for identifying
traditional instrument tones than the steady-state portion. In any case, time-varying
timbre is usually more tantalizing to the ear than a constant spectrum. (Grey 1975) [9]
Time varying Additive synthesis
By changing the mixture of sine waves over time, one obtains more interesting synthetic
timbres and more realistic instrumental tones. In the trumpet note in figure below, it
takes 12 sine waves to reproduce the initial attack portion of the event. After 300 ms,
only three or four sine waves are needed. [9]
41
Figure 2.10: The amplitude progression of the partials of a trumpet tone
Subtractive synthesis
Subtractive synthesis implies the use of filters to shape the spectrum of a sound source by
subtracting unwanted partials of its spectrum, while favoring the resonation of others. As
the source signal passes through a filter, the filter boosts or attenuates selected regions of
the frequency spectrum. If the original source is spectrally rich and the filter is flexible,
subtractive synthesis can shape close approximations of many natural sounds, as well as a
wide variety of new and unclassified timbres. This technique has been used successfully
to model percussion-like instruments and the human voice. [9]
Subtractive synthesis is often referred to as analogue synthesis because most analogue
synthesizers (i.e., non-digital) use this method of generating sounds. In its most basic
form, subtractive synthesis is a very simple process as follows:
42
OSCILLATOR ---------> FILTER ---------> AMPLIFIER
• An Oscillator is used to generate a suitably bright sound. This is routed through a
Filter.
• A Filter is used to cut-off or cut-down the brightness to something more suitable.
This resultant sound is routed to an Amplifier.
• An Amplifier is used to control the loudness of the sound over a period of time so
as to emulate a natural instrument
A filter can be literally any operation on a signal (Rabiner et al. 1972) but the most
common use of the term describes devices that boost or attenuate regions of a sound
spectrum, which is the usage. Filters can be one of these methods:
• Delaying a copy of an input signal slightly and combining the delayed input
signal with the new input signal.
• Delaying a copy of the output signal and combining it with the input signal. [9]
The chapter 3 containing the research project involves lot of band elimination and band
pass filtering. The bandwidth is a measure of the selectivity of the filter and is equal to
the difference between the upper and lower cutoff frequencies. The response of a band-
pass filter is often described by terms such as sharp (narrow) or broad (wide), depending
on the actual width. The pass band sharpness is often quantifies by means of quality
factor (Q). When the cutoff frequencies are defined at the -3 dB points, Q is given by
Q = f0/BW
43
Where BW is the bandwidth [25].
Therefore high Q denotes narrow bandwidth. Bandwidth may also be described as a
percentage of center frequency.
SPECTRUM MODELING SYNTHESIS
The proposed coder in this research work synthesizes the less sensitive signals. While
many synthesis techniques are available, the sinusoidal modeling synthesis is preferred
here because sound could be modeled as a set of sinusoids. Moreover, sinusoidal model
sometimes could be used in denoising applications.
The main advantage of this group of techniques is the existence of analysis procedures
that extract the synthesis parameters out of real sounds, thus being able to reproduce and
modify actual sounds. Our particular approach is based on modeling sounds as stable
sinusoids (partials) plus noise (residual component), thereby analyzing sounds with this
model and generating new sounds from the analyzed data. The analysis procedure detects
partials by studying the time-varying spectral characteristics of a sound and represents
them with time-varying sinusoids. These partials are then subtracted from the original
sound, and the remaining "residual" is represented as a time-varying filtered white noise
component. The synthesis procedure is a combination of additive synthesis for the
sinusoidal part and subtractive synthesis for the noise part.
This analysis/synthesis strategy can be used for either generating sounds (synthesis) or
transforming pre-existing ones (sound processing). To synthesize sounds we generally
44
want to model an entire timbre family, (i.e.,, an instrument) and that can be done by
analyzing single tones and isolated note transitions performed on an instrument and
building a database that characterizes the whole instrument or any desired timbre family,
from which new sounds are synthesized. In the case of the sound-processing application,
the goal is to manipulate any given sound, that is, not being restricted to isolated tones
and not requiring a previously built database of analyzed data. [2]
Some of the intermediate results from this analysis/synthesis scheme, and some of the
techniques developed for it, can also be applied to other music-related problems, e.g.,
sound compression, sound-source separation, musical acoustics, music perception, and
performance analysis.
The Deterministic Plus Stochastic Model
A sound model assumes certain characteristics of the sound waveform or the sound-
generation mechanism. In general, every analysis/synthesis system has an underlying
model. The sounds produced by musical instruments, or by any physical system, can be
modeled as the sum of a set of sinusoids plus a noise residual. The sinusoidal, or
deterministic, component normally corresponds to the main modes of vibration of the
system. The residual comprises the energy produced by the excitation mechanism that is
not transformed by the system into stationary vibrations plus any other energy component
that is not sinusoidal in nature. For example, in the sound of wind-driven instruments, the
deterministic signal is the result of the self-sustained oscillations produced inside the
bore, and the residual is a noise signal that is generated by the turbulent streaming that
takes place when the air from the player passes through the narrow slit. In the case of
45
bowed strings, the stable sinusoids are the result of the main modes of vibration of the
strings, and the noise is generated by the sliding of the bow against the string, plus by
other non-linear behavior of the combined bow-string-resonator system. This type of
separation can also be applied to vocal sounds, percussion instruments and even to non-
musical sounds produced in nature.
A deterministic signal is traditionally defined as anything that is not noise (i.e.,, an
analytic signal, or perfectly predictable part, predictable from measurements over any
continuous interval). However, in the present discussion the class of deterministic signals
considered is restricted to sums of quasi-sinusoidal components (sinusoids with slowly
varying amplitude and frequency). Each sinusoid models a narrowband component of the
original sound and is described by amplitude and a frequency function.
A stochastic signal is fully described by its power spectral density, which gives the
expected signal power versus frequency. When a signal is assumed stochastic, it is not
necessary to preserve either the instantaneous phase or the exact magnitude details of
individual FFT frames.
Therefore, the input sound is modeled by
S(t) = A∑=
R
r 1r(t) cos[θ r (t)] +e (t)
Where Ar(t) and θ r(t) are the instantaneous amplitude and phase of the rth sinusoid,
respectively, and e(t) is the noise component at time t (in seconds).
46
The model assumes that the sinusoids are stable partials of the sound and that each one
has a slowly changing amplitude and frequency. The instantaneous phase is then taken to
be the integral of the instantaneous frequency ω r(t), and therefore satisfies
θ r(t) = ∫t
0
ω r ττ d)(
Where )(tω is the frequency in radians and r is the sinusoid number.
By assuming that e (t) is a stochastic signal, it can be described as filtered white noise,
e(t) = h(t,∫t
0
τ )u (τ )dτ
where u(τ ) is white noise and h (t,τ ) is the response of a time-varying filter to an
impulse at time t. That is, the residual is modeled by the convolution of white noise with
a time-varying, frequency-shaping filter. [2]
Analysis/Synthesis Process: Sinusoids + Noise
The deterministic plus stochastic model has many possible implementations, and we will
present a general one while giving indications on variations that have been proposed.
Both the analysis and synthesis are frame-based processes with the computation done one
frame at a time. Throughout this description, we will consider that we have already
processed a few frames of the sound and we are ready to compute the next one.
47
Figure 2.11: Block diagram of the analysis process ([2])
The figure above shows the block diagram for the analysis. First, we prepare the next
section of the sound to be analyzed by multiplying it with an appropriate analysis
window. Its spectrum is obtained by the Fast Fourier Transform (FFT), and the prominent
spectral peaks are detected and incorporated into the existing partial trajectories by means
of a peak-continuation algorithm. The relevance of this algorithm is that it detects the
magnitude, frequency, and phase of the partials present in the original sound (the
deterministic component). When the sound is pseudo-harmonic, a pitch-detection step
can improve the analysis by using the fundamental frequency information in the peak
continuation algorithm and in choosing the size of the analysis window (pitch-
synchronous analysis). [2]
The stochastic component of the current frame is calculated by first generating the
deterministic signal with additive synthesis and then subtracting it from the original
48
waveform in the time domain. This is possible because the phases of the original sound
are matched, and therefore the shape of the time-domain waveform is preserved. The
stochastic representation is then obtained by performing a spectral fitting of the residual
signal.
Figure 2.12: Block diagram of the synthesis process [2]
The deterministic signal, i.e.,, the sinusoidal component, results from the magnitude and
frequency trajectories, or their transformation, by generating a sine wave for each
trajectory (i.e.,, additive synthesis). This can either be implemented in the time domain
with the traditional oscillator bank method or in the frequency domain using the inverse-
FFT approach.
The synthesized stochastic signal is the result of generating a noise signal with the time-
varying spectral shape obtained in the analysis (i.e.,, subtractive synthesis). As with the
deterministic synthesis, it can be implemented in the time domain by a convolution or in
49
the frequency domain by creating a complex spectrum (i.e.,, magnitude and phase
spectra) for every spectral envelope of the residual and performing an inverse-FFT. [2]
M-Q Synthesis
The spectrum-modeling synthesis described above is mainly targeted toward musical
signals. A similar application but a slightly different algorithm was proposed by Robert
McAulay and Thomas Quatieri for voice and speech signals [5, 26].
In 1986, Robert McAulay and Thomas Quatieri proposed a new method of
analysis/synthesis for discrete-time speech signals that attempted to develop a
reconstruction process that would result in a best possible approximation of the original
signal. They modeled speech signals as two components. The first was an excitation
signal which consisted of a sum of sinusoids with time-varying amplitudes and
frequencies, as well as an initial phase offset. The second component is the voice tract.
This is modeled as a time-variant filter with time-varying magnitudes and phases. These
two components are combined and expressed as
S(t) = A∑=
)(
1
tL
ll(t) ejΨ
l(t)
Where Al(t) combines the time-varying magnitude response of the vocal tract and the
amplitude of the excitation signal, and the phase of the exponential includes the time-
varying phase of the vocal tract as well as the initial phase offset of the excitation signal.
50
Figure 2.13: McAulay-Quatieri Sinusoidal Analysis-Synthesis system: [5]
To find expressions for these sinusoids, they derived a new technique to analyze the
signal. Using overlapping-windowing methods similar to standard short-time analysis,
the MQ method computes Fourier transforms of the individual windows. The peak
frequencies of each window (the partials) are found, and their amplitudes and phases are
extracted. The partials for each window are linked to those in the following window to
develop a trend in the progression of frequencies (their amplitude and phases). We call
each progression a track. The birth of a track occurs when there does not exist a partial in
the previous window with which to connect one in the current window. Conversely, a
death track occurs when a partial does not exist in the following window with which to
connect one in the current window.
51
Figure 2.14: Mcaulay-Quatieri Sinusoidal model for speech [5]
The MQ Model has outstanding results and reproduces inaudibly different signals when
applied to a wide variety of quasi-harmonic sounds. Perhaps its greatest advantage is the
small amount of data required to perform this process. To reproduce a signal using
standard Fourier techniques, information about a great many coefficients must be
retained. To reconstruct perfectly, an infinite number must be used. With the MQ
method, information about several time-varying sinusoids must be stored, and little else.
One of the flaws in the MQ method is how it represents noise. Noise shows up as tracks
that span only a small number of windows. It is difficult to represent these short tracks
using sinusoids, so other methods must be developed (see section entitled Bandwidth-
Enhancement).
52
In the analysis stage, the amplitudes, frequencies, and phases of the model are estimated
on a frame-by-frame basis, while in the synthesis stage these parameter estimates are
interpolated to allow for continuous evolution of parameters at all the sample points
between the frame boundaries.
The Sine Wave Speech Model
In the speech production model, the speech waveform s(t) is assumed to be the output of
passing a vocal cord (glottal) excitation waveform through a linear system representing
the characteristics of the vocal tract. The excitation function is usually represented as a
periodic pulse train during voiced speech, where the spacing between consecutive pulses
corresponds to “pitch” of the speaker. Alternately, the binary voiced/unvoiced excitation
model can be replaced by a sum of sine waves.
Figure 2.15: Peak detection in MQ-approach,: [5]
53
The motivation for this sine-wave representation is that voiced excitation, when perfectly
periodic, can be represented by a Fourier series decomposition in which each harmonic
component corresponds to a single wave. Passing this sine wave representation of the
excitation through the time-varying vocal tract results in the sinusoidal representation of
the speech waveform, which, on a given analysis frame is described by
s(n) = A∑−
L
l 1l(n) + φ l
where Al and φ l represent the amplitude and phase of each sine wave component
associated with the frequency track wl and L is the number of sine waves.[5]
Figure 2.16: Mcaulay-Quartieri Sinusoidal Analysis-Synthesis (Peak picking) [5]
54
Spectral Models Related to the Sinusoidal Model:
Additive synthesis is a traditional sound synthesis method that is very close to the
sinusoidal model. It has been used in electronic music for several decades [Roads 1995].
Like the sinusoidal model, it represents the original signal as a sum of sinusoids with
time-varying amplitudes, frequencies, and phases [Moorer 1985]. However, it does not
differentiate harmonic and inharmonic components. To represent non-harmonic
components it requires a very large amount of sinusoids, therefore giving best results for
harmonic input signals. Vocoders are another group of spectral models. They represent
the input signal at multiple parallel channels, each of which describes the signal at a
particular frequency band. Vocoders simplify the spectral information and therefore
reduce the amount of data. The Phase vocoder is a special type of vocoder that uses a
complex short-time spectrum, thus preserving the phase information of the signal. The
phase vocoder is implemented with a set of band pass filters or with a short-time Fourier
transform. The phase vocoder allows time and pitch scale modifications, like the
sinusoidal model does [4].
The sinusoidal model was originally proposed by McAulay-Quatieri for speech coding
[5] purposes and by Smith and Serra [2, 11] [McAulay-Quatieri 1986; Smith & Serra
1987] for the representation of musical signals. Even though the systems were developed
independently, they were quite similar. Some parts of the systems such as the peak
detection were slightly different, but both systems had all the basic ideas needed for the
sinusoidal analysis and synthesis: the original signal was windowed into frames, and the
55
short-time spectrum was examined to obtain the prominent spectral peaks. The
frequencies, amplitudes and phases of the peaks were estimated and the peaks were
tracked into sinusoidal tracks. The tracks were synthesized using linear interpolation for
amplitudes and cubic polynomial interpolation for frequencies and phases.
Serra [1989] was the first to decompose the signal into deterministic and stochastic parts
to use a stochastic model with the sinusoidal model. Since then, this decomposition has
been used in several systems. The majority of the noises modeling systems use two kinds
of approaches: either the spectrum is characterized by a time-varying filter or the short-
time energies within certain frequency bands [3]
Pitch-Synchronous Analysis
The estimation of the sinusoidal modeling parameters is a difficult task in general. Most
of the problems are related to the analysis window length. If the input signal is
monophonic or consists of harmonic voices that do not overlap in time, it is advantageous
to synchronize the analysis window length to the fundamental frequency of the sound.
Usually, the frequencies of the harmonic components of voiced sounds are integral
multiples of the fundamental frequency. The advantage of the pitch-synchronous analysis
is most easily seen in the frequency domain: the frequencies of the harmonic components
correspond exactly to the frequencies of the DFT coefficients. The estimation of the
parameters is very easy, since no interpolation is needed, and the amplitudes and phases
can be obtained directly from the complex spectrum. Also, pitch-synchronous analysis
56
allows the use of window lengths as small as one period of the sound, while non-
synchronized windows must be 2-4 times the period depending on the estimation method.
This means that a much better time resolution is gained by using the pitch-synchronous
analysis.
Unfortunately, pitch-synchronous analysis can not be utilized in the case where several
sounds with different fundamental frequencies occur simultaneously. In general,
monophonic recordings represent only a small minority among musical signals, and
therefore pitch-synchronous analysis typically can not be used. To keep the complexity of
the system low, the pitch-synchronous analysis was not included in our system.
Adaptive window length has been successfully used in modern audio coding systems, but
in a quite different manner: a long window is used for stationary parts of the signal, and
when rapid changes occur, the window is switched into a shorter one. This enables good
frequency resolution for the stable parts and a good time resolution in rapid changes.
Bandwidth-Enhanced Sinusoidal Modeling
The Reassigned Bandwidth-Enhanced Method [15], developed by Kelly Fitz resolves the
noise modeling problems associated with the MQ method. Using the MQ method,
signals are represented by a collection of sinusoidal components. The peaks in the
spectrum of each window are linked together (Short Time analysis). If the signal being
represented has obvious high peaks or a trend in the frequencies from window to
window, this analysis provides an accurate reconstruction. However, if a signal has
57
significant energy outside the peaks, or very high-frequency noise, the MQ method does
not represent the signal adequately. These signals are said to be noisy. The energy that is
not capable of being represented is called noisy energy because it has frequencies with
fast-varying amplitude.
Figure 2.17: Lemur
These types of signals require many sinusoids to be represented sufficiently. The
sinusoids that do represent them become a track of short duration partials with rapidly
varying amplitudes and frequencies. It is difficult to distinguish the noisy tracks due to
external unwanted noise from the short jittery tracks that are due to the wanted sound
representation. The sinusoidal model does not provide a way of distinguishing noisy
components from deterministic components. In addition, the representation of this noisy
58
Signal is very fragile. Time and frequency manipulation changes phase, which destroys
the properties of the sound and introduces errors in the reconstructed signal.
Figure 2.18: Lemur Graphical tool
To provide a better way of representing noise, the Reassigned Bandwidth-Enhanced
Method uses Bandwidth-Enhanced Oscillators, which spread spectral energy away from
the partial’s center frequency. The partial’s energy is increased while the bandwidth also
increases relative to its spectral amplitude. The center frequency stays the same so that
frequency is spread evenly on both sides. By removing the noisy tracks and increasing
the bandwidth of neighboring tracks, the energy in the signal is conserved and a closer
representation to the original signal can be constructed.
59
Figure 2.19: Bandwidth-Enhanced sinusoidal modeling
These Bandwidth-Enhanced Oscillators can now be used to synthesize a sound signal
from components that have varying frequencies, amplitudes, and concentrations of noise
and sinusoidal energy. A greater variety of sounds can now be represented with greater
accuracy while still using the sinusoidal model for the representation or longer, better-
defined tracks. These Bandwidth-Enhanced partials allow us to manipulate noise
representations without taking away the desired noise representations. The Enhanced-
Bandwidth Sinusoidal Model allows for appropriate representation of noise that provides
a way to distinguish the non-sinusoidal noise that must be removed.
PHASE SYNTHESIS
A summary of importance of phase in audibility is given here. In this research work we
cross examine the importance of phase parameters in the high-frequency region of the
spectrum. Models of pitch perception found in literature, often discard the phase of the
frequency components. This model contradicts time- domain models where the pitch of a
complex tone is given as function of the time interval between peaks in the waveform, in
“some dominant region of basilar membrane”. To verify if the relative phase of
60
harmonics of a complex tone is of importance to the perception of pitch, Moore
conducted a number of experiments. Here it was concluded that phase did infact have an
effect on perceived pitch in some cases. Most often however, it only affected the strength
of the perceived pitch. Cariani and Delgutte later verified this.
Terhardt considered that the pitch of complex tones in which it is assumed that only the
frequency spectrum of the stimulus is important in determining pitch, relative phase of
the frequency components being irrelevant. In the words of Schouten, pitch of a complex
tone is given by the reciprocal of the time interval between corresponding peaks in the
fine structure of the waveform evoked at some dominant region of the basilar membrane.
The fine structure of a waveform may be influenced by changes in the relative phase of
the components, and thus under some circumstances, pitch ought to be affected by
relative phase. Bennen said that the relative phase can affect pitch, but that the effect is
not mediated by changes in temporal structure of the waveform. According to the
temporal model two types of changes in pitch perception might occur with changes in
relative phase of the components, a change in pitch value, and a change in the clarity of
pitch. [17]
"The frequency-domain representation of periodic sounds was studied by the scientists
Ohm, Helmholtz, and Hermann in particular. Ohm stated that changes in the phase
spectrum, although they altered the wave shape, did not affect its aural effect. Helmholtz
developed a method of harmonic analysis with acoustic resonators. According to these
studies, the ear is phase-deaf, and timbre is determined exclusively by the spectrum. Such
61
conclusions are still considered essentially valid for periodic sounds only because Fourier
series analysis-synthesis works only for those.
It was Ohm who first postulated in the early nineteenth century that the ear was, in
general, phase deaf. This view was a gross simplification. There are many instances in
which phase plays an extremely important role in the perception of sound stimuli. In fact,
it was Helmholtz who noted that Ohm's law didn't hold for simple combinations of pure
tones. However, for non-simple tones Ohm's law seems to be well supported by psycho-
acoustic literature. The importance of phase in perceiving musical sounds was
demonstrated by Clark, who clearly showed that in the absence of phase information,
acoustic waveforms sounded unrealistic.
Effect of phase on the timbre
One of the dimensions, which govern the quality, or Timbre of the music instrument is
the directionality of the sound. The directionality, binaural or monaural is decided by the
phase of the sound signal. In general literature of the past doesn’t pay more attention to
the phase of the signal but considering its directional property and spatialization, phase
plays a major role but in general, when we are ready to not consider some dimensions
while synthesizing, phase does not seem to be of much of importance.
Changes in timbre are not distinct enough to be observed after a few seconds required to
alter the phases; anyhow these changes are too small to transfer from one vowel to
62
another. Harmonics beyond the sixth to eighth give dissonances and beats, so it is not
excluded that, for these higher harmonics, a phase effect exist. The maximum effect of
phase on timbre is the difference between a complex tone in which the harmonics are in
phase and one in which alternate harmonics differ in phase by 90o. The effect of
lowering each successive harmonic by 2 dB is greater than the maximum phase effect
described above. The effect of phase on timbre appears to be independent of sound level
and the spectrum.
Phase is perceptually important in many situations. However, it still remains a question
to which extent phase is of importance to modeling of natural occurring sounds using
spectral sound models. For non-periodic sound parts such as transients, the phase of the
frequency components is of great importance to the perception of sounds than steady
parts. [10]
The sound synthesized, using STFT,
S (t) = a∑=
k
n 0n (t) cos [θ n (t)]
where k is the number of partials, an (t), the time varying amplitude, and [θ n (t)] is the
time varying phase.
For synthesis without phase information θ n (t) is simply obtained by integration of
measured frequency values over time: [24]
63
θ n (t) = ∫t
0
ω n(τ ) dτ
Phase is an important parameter when performing additive analysis/synthesis of binaural
recordings. If the phase is left out, the ability to perceive spatial qualities of sounds is
substantially degraded. The phase is important in all incident positions, except
front/back, whereas spectral envelope is mainly influential in the lateral positions.
Perceptually important cues for use in localization are expected to be less present in
sounds synthesized without phase than with phase information. [24]
Perceptual Coding Vs Synthesis based approach
Perceptual coding is a digital audio coding technique that reduces the amount of data
needed to produce high-quality sound. Perceptual digital audio coding takes advantage of
the fact that the human ear screens out a certain amount of sound that is perceived as
noise. Reducing, eliminating, or masking this noise significantly reduces the amount of
data that needs to be provided. With perceptual coding of the record, physical identity is
waived in favor of perceptual identity. Using a psychoacoustical model of the human
auditory system, the codec identifies imperceptible signal content (to remove irrelevancy)
as bits are allocated. The signal is then coded efficiently (to avoid redundancy) in the
final bit stream. These steps reduce the quantity of data needed to represent an audio
signal. The intent is to hide quantization noise below signal-dependent thresholds of
hearing and then code as efficiently as possible. The method asks how much noise can
be introduced to the signal without becoming audible.
64
In the view of many observers, compared to new perceptual coding methods, Pulse-code
Modulation is a powerful but inefficient dinosaur. Due to the appetite for bits, PCM
coding is limited in its usefulness. Achieving lower bit rates through perceptual coding is
limited in its usefulness. The desire to achieve lower bit rates through perceptual coding
is appealing because it opens new application for digital audio (and video) with
acceptable signal degradation. Through psychoacoustics, we can understand how
information is perceived by the ear [8].
Masking and Perceptual Coding
The world presents us with a multitude of sound simultaneously. We automatically
accomplish the task of distinguishing each of the sounds and attending to the ones of
greatest importance. It is often difficult to hear one sound when a much louder sound is
present. This process seems intuitive, but on the psychoacoustic and cognitive levels it
becomes very complex. The term for this process is masking. In order to gain a broad
and thorough understanding of masking phenomenon, we can survey the definition and
its accompanying explanation from several views. Masking as defined by the American
Standards Association (ASA) is the amount (or the process) by which the threshold of
audibility for one sound is raised by the presence of another (masking) sound. For
example, a loud car stereo could mask the car's engine noise. The term was originally
borrowed from studies of vision, meaning the failure to recognize the presence of one
stimulus in the presence of another at a level normally adequate to elicit the first
perception.
65
The purpose of any data-reduction system is to decrease the data rate, the product of the
sampling frequency and the word length. This can be accomplished by decreasing the
sampling frequency; however, the Nyquist theorem dictates a corresponding decrease in
high-frequency audio bandwidth. Another approach uniformly decreases the word
length; however, this reduces the dynamic range of the audio signal by 6dB/bit, thus
increasing the quantization noise. A more enlightened approach uses psychoacoustics.
Perceptual coders maintain sampling frequency but selectively decrease word length;
word-length reduction is done dynamically based on signal conditions [8].
Perceptual coders analyze the frequency and amplitude content of the input signal and
compare it to a model of human auditory perception. Using the model, the encoder
removes the irrelevancy and statistical redundancy of the audio signal. In theory,
although the method is lossy, the human perceiver will not hear degradation in the
decoded signal. Considerable data reduction is possible. For example, a perceptual coder
might reduce a channel’s bit rate from 768 kbps to 128 kbps; a word length of 16
bits/sample is reduced to an average of 2.67 bits/sample, and the data quantity is reduced
by about 83%. A well-designed perceptually coded recording, with a conservative level
of reduction, can rival the sound quality of a conventional recording because the data is
coded in a much more intelligent fashion, and quite simply, because we do not hear all of
what is recorded anyway. Perceptual coders are so efficient that they require only a
fraction of the data needed by a conventional system.
Part of this efficiency stems from the adaptive quantization used by most perceptual
coders. With PCM, all signals are given equal word lengths. Perceptual coders assign
66
bits according to audibility. A prominent tone is given a large number of bits to ensure
audible integrity. Conversely, fewer bits can be used to code soft tones. Inaudible tones
are not coded at all. Together, bit rate reduction is achieved. A coder’s reduction ratio is
the ratio of input bit rate to output bit rate. Reduction ratios of 4:1, 6:1, or 12:1 are
common. Perceptual coders have achieved remarkable transparency, so that in many
applications reduced data is audibly indistinguishable from linearly represented data.
Tests show that ratios of 4:1 or 6:1 can be transparent. [8]
Critical Bands
To determine this threshold of audibility, an experiment must be performed. A typical
masking experiment might proceed as follows. A short, (about 400 msec) pulse of a
1,000 Hz sine wave acts as the target, or the sound the listener is trying to hear. Another
sound, the masker, is a band of noise centered on the frequency of the target (the masker
could also be another pure tone). The intensity of the masker is increased until the target
cannot be heard. This point is then recorded as the masked threshold. Another way of
proceeding is to slowly widen the bandwidth of the noise without adding energy to the
original band. The increased bandwidth gradually causes more masking until a certain
point is reached, at which no more masking occurs. This bandwidth is called the critical
band [6]. We can keep extending the masker until it is full bandwidth white noise, and it
will have no more effect than at the critical band.
67
Bark Band
Center Frequency
(Hz)
Critical Bandwidth
(Hz)
Low Frequency Cutoff
(Hz)
High Frequency Cutoff
(Hz) 1 50 -- -- 100
2 150 100 100 200
3 250 100 200 300
4 350 100 300 400
5 450 110 400 510
6 570 120 510 630
7 700 140 630 770
8 840 150 770 920
9 1000 160 920 1080
10 1170 190 1080 1270
11 1370 210 1270 1480
12 1600 240 1480 1720
13 1850 280 1720 2000
14 2150 320 2000 2320
15 2500 380 2320 2700
16 2900 450 2700 3150
17 3400 550 3150 3700
18 4000 700 3700 4400
19 4800 900 4400 5300
20 5800 1100 5300 6400
68
21 7000 1300 6400 7700
22 8500 1800 7700 9500
23 10500 2500 9500 12000
24 13500 3500 12000 15500
25 18775 6550 15500 22050
Table 2.1: Critical Bandwidth as a function of center frequency and critical band rate [8]
Critical bands grow larger as we ascend the frequency spectrum. Conversely, we have
many more bands in the lower frequency range, because they are smaller. Critical bands
seem to be formed at some level by a auditory filter bank. Critical bands and their center
frequencies are continuous, as opposed to having strict boundaries at specific frequency
locations. Therefore, the filters must be easily variable. Use of the auditory filter bank
may be the unconscious equivalent of our willfully focusing on a specific frequency
range.
Non-Simultaneous Masking
The ASA definition of masking does not address non-simultaneous masking. Sometimes
a signal can be masked by a sound preceding it, called forward masking, or even by a
sound following it, called backward masking. Forward masking results from the
accumulation of neural excitation, which can occur for up to 200 msec. In other words,
neurons store the initial energy and cannot receive another signal until after they have
passed it, which may be up to 200 msec. Forward masking effects are slight because
maskers need to be within the same critical band and even then do not have the broad
69
masked audiograms of simultaneous masking. Likewise, backward masking only occurs
under tight tolerances.
Central Masking and Other Effects
Another way to approach masking is to question at what level it occurs. Studies in
cognition have shown that masking can occur at or above the point where audio signals
from the two ears combine. The threshold of a signal entering monaurally can be raised
by a masker entering in the other ear monaurally. This phenomenon is referred to as
central masking, because the effect occurs between the ears.
Spatial location can have a profound effect on the effectiveness of a masker. Many
studies have been performed in which unintelligible speech can be understood once the
source is separated in space from the interference. The effect holds whether the sources
are actually physically separated or perceptually separated through the use of interaural
time delay. Asynchrony of the onset of two sounds has shown to help prevent masking,
as long as the onset does not fall within the realm of non-simultaneous masking. Each 10
msec increase in the inter-onset interval was perceived as being equal to a 10 dB increase
in the target's intensity [6].
Fusion
The concept of fusion must be included in any intelligent discussion of masking, because
the two are similar and often confused. In both cases, the distinct qualities of a sound are
lost, and both phenomena respond in the same manner to the same variables. In fusion,
like in masking, the target sound cannot be identified, but in fusion the masker takes on a
70
different quality. The typical masking experiment does not necessarily provide a
measure of perceptual fusion. In a fusion experiment, on the other hand, listeners are
asked whether they can or cannot hear the target in the mixture or, even better, to rate
how clearly they can hear the target there. What we want to know is whether the target
has retained its individual identity in the mixture. [6]
Fusion takes into consideration interactive global effects of two sound sources on each
other, instead of trying to reduce the situation to two separate and distinct entities.
Masking experiments are concerned with finding the threshold at which the target cannot
be identified, ignoring the effect of the target on the masker. Use of psychoacoustic
principles for the design of audio recording, reproduction, and data reduction devices
makes perfect sense. Audio equipment is intended for interaction with humans, with all
their abilities and limitations of perception. Traditional audio equipment attempts to
produce or reproduce signals with the utmost fidelity to the original. A more
appropriately directed, and often more efficient, goal is to achieve the fidelity perceivable
by humans. This is the goal of perceptual coders.
The core part of a perceptual coder is the psychoacoustic model, which is the heart of the
system. Generally a psychoacoustic model performs, a time to frequency mapping,
determines maximum SPL levels, determines threshold in quiet, identify tonal and no
tonal components, decimates the maskers, calculates masking thresholds, determines
global masking thresholds, determines minimum masking thresholds and calculates
signal to mask ratios.
71
Figure 2.20: MPEG audio compression and decompression
Although one main goal of digital audio perceptual coders is data reduction, this is not a
necessary characteristic. Perceptual coding can be used to improve the representation of
digital audio through advanced bit allocation.
All data reduction schemes are not necessarily perceptual coders. Some systems, the
DAT 16/12 scheme for example, achieve data reduction by simply reducing the word
length, in this case cutting off four bits from the least-significant side of the data word,
achieving a 25% reduction. The data reduction scheme in the present research work,
however, uses a different scheme based on partial sound synthesis that relies on auditory
phenomenon of human sensitivity to certain frequencies. Though the mid-frequencies are
modulated to the baseband by sampling the signal at a lower sampling rate and
downsampling it later in time, an upsampling process interpolates the in-between samples
,and then the signal is demodulated back to it’s original spectral location. This research is
72
not a perceptual-coding scheme, although it uses or takes advantage of perceptual
phenomena
Out of a desire for simplicity, the first digital audio systems were wide-band systems,
tackling the entire audio spectrum at once. Presently, perceptual coders are multiband
systems, dividing up the spectrum in a fashion that mimics the critical bands of
psychoacoustics. By modeling human perception, perceptual coders can process signals
much the way humans do, and take advantage of phenomena such as masking.
While using adaptive delta pulse-code modulation (ADPCM), the frequency spectrum is
divided into four bands to remove unperceivable material. Once a determination is made
as to what can be discarded, the remainder is allocated the available number of bits. This
process is called dynamic bit allocation.
History of Synthesis-based audio Data reduction
The synthesis-based music data reduction is a relatively young area of research. Xavier
Serra’s sinusoidal plus stochastic residual noise model [2, 11] was an effective synthesis -
based data reduction system. A similar idea was MQ synthesis by McAulay and Quatieri
[5, 20, and 26]. The application here was data reduction for digital speech processing
applications. This research work is all about improving or replacing the two powerful
music and speech data reduction approaches. The third chapter forms the research
project. Eric Scheirer wrote a synthesis-based data reduction approach as a part of
MPEG-4, 7 and 21, the structured audio orchestral language (SAOL). Unlike all these
historical synthesis-based data reduction schemes, which employ complete synthesis
73
techniques, our data reduction scheme is a partial synthesis scheme. There is no
connection to SAOL in this research work. However, we will do a review of the same
because we are reviewing the history of synthesis based approaches in data reduction.
AN OVERVIEW OF SAOL: (Structured Audio Orchestral language)
SAOL is a powerful, flexible language for describing music synthesis, and integrating
synthetic sound with "natural" (recorded) sound in an MPEG-4 bit stream. MPEG-4
integrates the two common methods of describing audio on the internet today: streaming
low-bit rate coding and structured audio descriptions (like MIDI files).
SAOL lives within the MPEG-4 paradigm of streaming data and decoding processes.
Thus, the Structured Audio toolset is not only a method of synthesis, but a streaming
format appropriate for internet-based (or any other channel) transmission of audio data.
The saolc package contains a program for encoding score and orchestras into the
streaming format, and facility for decoding this format.
MPEG-4 Structured Audio has its roots in another Media Lab project called Netsound,
developed by Michael Casey and other members of the Machine listening group at the
MIT Media Lab in 1995-1996. NetSound has similar concepts to MPEG-4 Structured
Audio but uses Csound developed by Barry Vercoe for synthesis.
There are five major elements to the Structured Audio toolset:
• The Structured Audio Orchestra Language (or SAOL) is a digital-signal processing
language that allows for the description of arbitrary synthesis and control algorithms
74
as part of the content bit stream. The syntax and semantics of SAOL are standardized
here in a normative fashion.
• The Structured Audio Score Language, (or SASL) is a simple score and control
language which is used in certain profiles to describe the manner in which sound-
generation algorithms described in SAOL are used to produce sound.
• The Structured Audio Sample Bank Format (or SASBF) The Sample Bank format
allows for the transmission of banks of audio samples to be used in Wavetable
synthesis and the description of simple processing algorithms to use with them.
• A normative scheduler description. The scheduler is the supervisory run-time element
of the Structured Audio decoding process. It maps structural sound control, specified
in SASL or MIDI, to real-time events dispatched using the normative sound-
generation algorithms.
• Normative reference to the MIDI standards, standardized externally by the MIDI
Manufacturers Association. MIDI is an alternate means of structural control which
can be used in conjunction with or instead of SASL. Although less powerful and
flexible than SASL, MIDI support in this standard provides important backward-
compatibility with existing content and authoring tools. [16]
Our research does not focus on any synthesis methods which use an audio/music
description format languages, nor does it has a package for writing music scores. Such a
project is reserved as a future extension of this research work. Our method is
transmitting PCM samples which are very sensitive perceptually and synthesize spectrum
which are not much sensitive in perception. Details of this project implementation follow
in the next chapter.
75
CHAPTER-3
THE RESEARCH PROJECT
A Hi-Fidelity audio data reduction scheme using Partial Sinusoidal modeling
Synthesis Scheme (PSMS)
While perceptual coding is based on energy levels of our perception, and the masking
phenomenon, this synthesis-based data reduction approach presented here is based on
making use of our complex pitch perception. In other words, the former takes the vertical
amplitude scale hearing into account to code the bits which we hear based on loudness
criterion whereas the later takes the horizontal frequency scale, to make use of the
complex pitch perception mechanisms which our auditory systems somehow manage to
do. This project takes advantage of auditory systems complexity in order to engineer the
music product.
A real world example
We hear music inside a room. When we come out of the room and close the door, we
still hear music with an obvious attenuation in the energy level we perceive. The door
acts as an attenuator. The door also filters out some frequency components. Since our
auditory system is sensitive to mid-frequencies, it somehow perceives most of the mid
frequency contents. We hear the pitch (fundamental) but not necessarily timbre. The
music that we perceive after the door is closed is the mid-frequency spectrum. We
76
transmit the mid-frequency content. The low and high frequencies are synthesized. The
physical spectrum of the synthesized parts will differ widely or slightly over each short
time period. However, the variation in spectral shape does not mean that the sound will
be different from that of the original. The synthesized content will sound almost close to
original content because our auditory system does not follow spectral shapes exactly.
Hence we make use of the phenomenon of our inabilities in following the complex pitch,
the missing fundamental concept, and other concepts which we discussed in detail in the
Chapter 2.
One another example is our daily conversation with people over telephones and mobiles
where we hear only between 1 to 4 kHz due to the frequency range of the speech signal.
This helps the transmission purposes. The aim in our project is not to do intuitive
research on the complex topic of pitch perception procedures done by the auditory system
but to make use of those valid ideas to engineer a music compression and synthesis
system. Frequency is a literal measurement, pitch is not. Pitch is a subjective, complex
characteristic based on frequency as well as other physical quantities such as waveform
and intensity. For example, if a 200-Hz sine wave is sounded at a soft then louder level,
most listeners will agree that the louder sound has a lower pitch. In fact, a 10% increase
in frequency might be necessary to maintain a listener’s subjective evaluation of constant
pitch at low frequencies. On the other hand, in the ear’s most sensitive region, 1 to 5
kHz, there is almost no change in pitch with loudness. Also, with musical tones, the
effect is much less [8].
77
Figure 3.1: Perceptual Coding approach Vs Synthesis based approach
78
Humans can identify pitch defects very easily in the sensitive region, the mid-frequency
region, but in this work we do not approximate or do alternations/modifications to the
mid-frequency samples. We transmit it as such thereby avoiding the possibilities of
creating a defective signal. This chapter will cover four major topics.
1. Two-filter method;
2. Frequency resolution downsampling method;
3. Four-filter method;
4. Advantages of using a sine plus noise model in the two filter method.
FLETCHER AND MUNSON CONTOURS: DATA SETS
In order to analyze the success level of the four topics mentioned above, visual Fletcher
Munson plots will be useful. Therefore the full data set is included in figures.
Figure 3.2: Fletcher Munson original curves (Fig 3 [1])
79
Figure 3.3: Figure 2 mentioned in [1]
From these curves in Figure 3.3 loudness level contours can be drawn. The first sets of
loudness level contours are plotted with levels above the reference threshold as ordinates.
80
For example, the zero loudness level contours correspond to points where the curves of
figure 3.3 intersect the abscissa axis. The number of dB above these points is plotted as
the ordinate in the loudness level contour shown in Figure 3.4 [1].
Figure 3.4: Figure 3 in [1]
In Figure 3.2 similar sets of loudness level contours are shown using intensity levels as
ordinates.
81
Figure 3.5: Matlab plot of figure 3.4
The formulae for computing the Figure 3 plots of the 1933 Fletcher Munson paper are
given in Table BI of the paper cited in [22]. This is a nonlinear regression formula. It is
fit to the 1933 paper raw data rather than the published curves, which are quite smoothed.
y = C3*x3 + C2*x2 + C1*x + C0
where C0, C1, C2, C3 are polynomial coefficients
y(1) = 7.46169*(10^-5)* x3 - 0.00984189* x2+ 0.74629*x + 0.425879; (62 Hz)
y(2)= 5.2594*(10^-5)* x3 - 0.00654132* x2+ 0.800557*x+0.295663; (125 Hz)
y(3) = 0.00124457* x2+ 0.720323*x + 0.780066; (250 Hz)
y(4) = 0.00209933* x2 + 0.761911*x + 0.467849; (500 Hz)
y(5) = x; (1000 Hz)
82
y(6) = -0.0011956* x2+ 1.14141*x - 0.622967; (2000 Hz)
y(7) = -0.00240718* x2 + 1.2314*x - 0.393083; (4000 Hz)
y(8)= -0.00272458* x2+ 1.24014*x + 0.481007; (5650 Hz)
y(9) = -0.00232339* x2 + 1.20659*x + 0.0426691; (8000 Hz)
y(10) = -0.002439* x2 + 1.24474*x - 1.51871; (11300 Hz)
y(11) = -0.000566296* x2 + 1.03446*x - 1.82771; (16000 Hz)
Phon Sound Pressure Levels (dB)
0 48 32 20 5 0 -24 -32 -20 -12 -10 8
10 52 40 28 14 10 7 2 10 20 22 30
20 60 48 35 25 16 20 12 20 30 32 38
30 65 52 42 32 28 30 22 30 40 42 50
40 68 60 48 42 40 40 35 42 52 50 58
50 71 64 55 49 50 50 46 55 60 65 66
60 75 68 65 60 60 60 56 62 72 76 79
70 80 72 70 71 70 70 63 72 82 83 85
80 85 82 81 80 81 80 75 82 89 90 95
90 90 91 90 90 88 90 83 90 96 99 108
100 100 100 102 104 98 100 90 99 104 105 111
110 111 110 111 113 108 110 98 105 110 112 118
Table 3.1: Fletcher Munson curve: data sets
83
THE GENERAL PROCEDURE
Two-Filter Method
Figure 3.6: The Schematic Block Diagram employed in our Synthesis based Data
Reduction (Two filter method).
We are sensitive to the mid-frequencies around 3300 Hz. The input audio signal is
filtered with a band pass filter with cutoff frequencies 0.9 KHz and 6.1 KHz that filters
these mid-frequencies. A band-elimination filter with cutoff frequencies 1.1 KHz and 5.9
KHz eliminates the mid band from the input signal. A 200 Hz overlap is set between
transition band to avoid phase distortion and spectral leakage. The output of the band
pass filter is the most sensitive data that we hear. The spectral contents above the
threshold of hearing are modulated to base band and are downsampled in time. The base
84
band signal is transmitted over the communication channel. At the receiver, the base
band is reconstructed back to the pass band area. The band pass filter (0.9 KHz and 6.1
KHz) and band elimination filters (1.1 kHz and 5.9 kHz) can be designed based on ones
need. The cutoff frequencies were chosen based on Fletcher and Munson curves.
Moreover, the Fletcher-Munson curves provide information about the threshold of
hearing that helps the peak detection algorithm.
The output of the band-rejection filter forms the less sensitive data. This output forms
input to a sinusoidal/spectral model [2]. A short time Fourier transform is applied to
windowed time frames with a 75% frame overlap rate. The length of the window is fixed
such that it captures the periodicity more than one time. The 75% overlap frame rate is
set to avoid signal leakage. Normally a minimum of 75% overlap for hamming windows
and 50% overlap for rectangular windows are needed to avoid signal leakage. This is
because the main lobe width of hamming window is four bins and that of rectangular
window is two bins. The hop size is the product of the analysis frame length and the
inverse of the main lobe width in bins. In any analysis and synthesis methods, the choice
of window is very critical. In this case Kaiser Windowing scheme and hanning
windowing scheme turned to be comparatively successful than other windowing
schemes. Different windowing schemes will be elaborately discussed later in this
chapter.
The short time Fourier spectrum was computed and the prominent peaks in the power
spectrum were picked. A peak here is defined as a local maximum. There are chances
85
where two peaks might be very close to each other. In such cases, the maximum of the
two peaks is chosen. The other peak is deleted. This will help us in maintaining the
frequency characteristic of a sinusoid while connecting the peaks into frequency distinct
tracks.
Figure 3.7: The Two Filter method: Band Pass and Band elimination filters
Violin spectrum, FS = 44100, mono, 16 Bit
Modulation and Demodulation: MF Band
The output of the band-pass filter has the most sensitive data that has to be transferred
through the channel as PCM audio samples. The sensitive data is modulated to the base
band with a sampling frequency twice the highest frequency in the base band. The base
band signal is downsampled in time by a factor four. This reduces the data through the
86
channel. On the receiver end, the data is up sampled by a factor four to recover the
original length and also moved to the original pass band with a sampling rate of the input
mid band and fused with the less sensitive data that forms the output of the oscillator at
the synthesis end. However, extreme care should be taken while dealing with the
sensitive data because any artifacts in this data will be very clearly audible to human ears.
Figure 3.8: Modulation and demodulation of sensitive data
(Country music, 44.1, 16 bit, mono at 705 Kbps)
In the Figure 3.8, a downsampling factor two is set. However, with a factor four better
data reduction can be obtained.
87
Channel
The modulated mid-frequency content is transmitted over the communication channel.
During the course of transmission, the channel noise, which consists of the transmission
noise and, reception noise, will affect the original data.
Partial Sinusoidal Modeling Synthesis (PSMS): LF and HF band
The main advantage of spectrum modeling techniques is the existence of analysis
procedures that extract the synthesis parameters out of real sounds, thus being able to
reproduce and modify actual sounds. SMS is based on modeling sounds as stable
sinusoids (partials) plus noise (residual component), therefore analyzing sounds with this
model and generating new sounds from the analyzed data. Before starting the analysis
band elimination filter chops off the mid frequency spectral data to which we are
sensitive. The analysis procedure detects partials by studying the time-varying spectral
characteristics of a sound and represents them with time-varying sinusoids. The
synthesis procedure is additive synthesis method where the instantaneous amplitudes,
frequencies and phases are fed into separate oscillators, and all sinusoids are added frame
to frame. In audio signal spectrum modeling, the aim is to transform a signal to a more
easily applicable form, removing the information that is irrelevant in signal perception..
A sufficient time and frequency resolution is also difficult to achieve at the same time.
The standard pulse code modulated (PCM) signal which basically describes the sound
pressure levels reaching the ear is not a good presentation for the analysis of sounds. A
88
general approach is to use spectrum modeling, or a suitable middle-level representation to
transform the signal into a form that can be generated easily from the PCM signal, but
from which also the higher level information can be more easily obtained. The
sinusoids+noise model is one of these representations. The sinusoidal part utilizes the
physical properties of general resonating systems by representing the resonating
components by sinusoids. The noise model utilizes the inability of humans to perceive the
exact spectral shape or phase of stochastic signals. The sinusoids+noise model has the
ability to remove irrelevant data and encode signals with lower bit rate. It has also been
successfully used in audio and speech coding [3]. This project focuses on only a
complete sinusoidal model. PSMS using a sine plus noise model for less sensitive data
synthesis will be a future extension of the project.
A short overview of the process theoretically and the results obtained during the step-by-
step implementation follows. At first, the input signal is analyzed to obtain time-varying
amplitudes, frequencies and phases of the sinusoids. Then, the sinusoids are synthesized.
In the parametric domain, we can make modifications to produce effects like pitch
shifting or time stretching. The analysis of sinusoids is the most complex part of the
system. Firstly, the input signal is divided into partly overlapping and windowed frames.
Secondly, the short-time spectrum of the frame is obtained by taking a discrete Fourier
transform (DFT/FFT). The spectrum is analyzed, prominent spectral peaks are detected
,and their parameters, amplitudes, frequencies, and phases are estimated. Once the
amplitudes, frequencies and phases of the detected sinusoidal peaks are estimated, they
are connected to form interframe trajectories. A peak continuation algorithm tries to find
89
the appropriate continuations for existing trajectories from among the peaks of the next
frame. The obtained sinusoidal trajectories contain all the information required for the
resynthesis of the sinusoids. The sinusoids can be synthesized by interpolating the
parameters of trajectories and summing the resulting waveforms in time domain.
The next phase of Partial SMS includes magnitude and phase spectra computation, Peak
Detection, Peak Continuation, Cubic Spline interpolation between time frames,
modifications of the Analysis Data, Synthesis, and fusion of sensitive and less sensitive
data. The very first step to begin with is the magnitude and phase spectra computation.
FFT Analysis of Low and High Frequency Bands
The computation of the magnitude and phase spectra of the current frame is the first step
in the analysis. The control parameters for the STFT (window-size, window-type, FFT-
size, and frame-rate) have to be set in accordance with the sound to be processed. First of
all, a good resolution of the spectrum is needed since the process that tracks the partials
has to be able to identify the peaks.
In case if a PSMS involves a sine plus noise model, the phase information is particularly
important for subtracting the deterministic component to find the residual; we should use
an odd-length analysis window, and the windowed data should be centered in the FFT-
buffer at the origin to obtain the phase spectrum free of the linear phase trend induced by
the window ("zero-phase" windowing).
90
Figure 3.9: Original and Windowed short time signals, Fourier analysis (Hanning
window)
The time-frequency compromise of the STFT must be well understood. For the
deterministic analysis, it is important to have enough frequency resolution to resolve the
partials of the sound. For the stochastic analysis the frequency resolution is not that
important, since we are not interested in particular frequency components and we are
more concerned with a good time resolution. This can be accomplished by using different
parameters for the deterministic and the stochastic analysis. In this project we use a
91
sinusoidal model and not a sine plus noise model. The reason for not using a sine plus
noise model will be discussed in the coming sections.
Figure 3.10: Magnitude and Phase spectrum of LF and HF bands
The computation of the spectra is carried out by the short-time Fourier transform
technique.
Window choice
The window choice plays an important role in any analysis and synthesis system. The
choice depends on the problem encountered, the amount of overlap rate, and the precision
92
expected and computational cost. Popular windows like the Hamming requires 75%
overlap rate, i.e, the hop size is 25% of the analysis frame length. The rectangular
window needs at least 50% overlap. In steady-state sounds we should use long windows
(several periods) with a good side-lobe rejection (for example, Blackman-Harris 92dB)
for the deterministic analysis. This gives a good frequency resolution, therefore a good
measure of the frequencies of the partials. In the case of harmonic sounds the actual size
of the window will change as pitch changes, in order to assure a constant time-frequency
trade-off for the whole sound.
The choice of analysis window determines the trade-off of time verses frequency
resolution, which affects the smoothness of the spectrum and the detectability of different
sinusoidal components. The most commonly used windows are rectangular, hamming,
hanning, Kaiser, Blackman and Blackman-Harris.
All the standard windows are real and symmetric and have a frequency spectrum with a
sinclike shape. For the purposes of SMS and in general for any sound analysis/synthesis
application, the choice of window is mainly determined by two of its spectral
characteristics. They include the width of the main lobe, defined for present purposes as
the number of bins between zero crossings on either side of the main lobe when the DFT
length equals the window length, and the highest side-lobe level, which measures the
gain from the highest side lobe to the main lobe. Ideally, we want a narrow main lobe
[i.e., good frequency resolution) and a very low side lobe level). The choice of window
determines this trade-off. The rectangular window has the narrowest main lobe, (two
bins), but the first side lobe is very high-13 dB relative to the main lobe peak. The
93
Hamming window has a wider main lobe, four bins, and highest side lobe is 43 dB down.
A very different window, the Kaiser, allows control of the trade-off between main lobe
width and the highest side lobe level. If a narrower main lobe width is desired, then the
side lobe level will be higher, and vice versa. Since the control of this tradeoff is
valuable, Kaiser Window is good general purpose choice. The window length must be
sufficient to resolve the most closely spaced sinusoidal frequencies. A nominal choice
for periodic signals is about four periods [11].
DFT Computation
Once a section of the waveform has been windowed, the next step is to compute the
spectrum using the DFT. For practical purposes, the FFT should be used whenever
possible, but this requires the length of the analyzed signal to be a power of two. This
can be accomplished by taking any desired window length and “zero padding,” i.e.,,
filling with zeros out to the length required by the FFT. This not only allows use of FFT
algorithm, but computes a smoother spectrum as well. Zero padding in time domain
corresponds to spectral interpolation.
The size of the FFT, N, is normally chosen to be the first power of two that is at least
twice the window length M, with the difference N-M filled with zeros. If B, the number
of samples in the main lobe when the zero padding factor is 1 (N=M), then a zero
padding factor of N/M gives BsN/M samples for the same main lobe [and same main lobe
bandwidth). The zero-padding (interpolation) factor N/M should be large enough to
enable an accurate estimation of the true maximum of the main lobe. That is, since the
window length is not an exact number of periods for every sinusoidal frequency, the
94
spectral peaks do not, in general, occur at FFT bin frequencies (multiples of Fs/N).
Therefore, the bins must be interpolated to estimate peak frequencies. Zero padding is
one type of spectral interpolation [11].
Choice of Hop Size
Once the spectrum has been computed at a particular frame in the waveform, the STFT
hops along the waveform and computes the spectrum of next section in the sound. This
hop size H is an important parameter. Its choice depends very much on the purpose of
the analysis. In general, more overlap will give more analysis points and therefore
smoother results across time, but the computational expense is proportionally greater. A
general and valid criterion is that the successive frames should overlap in time, in such a
way that all the data are weighted equally. A good choice is the window length divided
by the main lobe width in bins. For example, a practical value for the hamming window
is to use a hop size equal to one fourth of the window size [11].
Peak Detection in PSMS
The input sound is filtered by a band-pass filter in and around the region of sensitivity of
human ear. Once the spectrum of the current frame is computed, the next step is to detect
its prominent magnitude peaks. Theoretically, a sinusoid that is stable both in amplitude
and in frequency (a partial) has a well defined frequency representation: the transform of
the analysis window used to compute the Fourier transform.
95
It should be possible to take advantage of this characteristic to distinguish partials from
other frequency components. However, in practice, this is rarely the case, since most
natural sounds are not perfectly periodic and do not have nicely spaced and clearly
defined peaks in the frequency domain.
Figure 3.11: Peak detection in LF and HF bands
There are interactions between the different components, and the shapes of the spectral
peaks cannot be detected without tolerating some mismatch. Only some instrumental
sounds (e.g., the steady-state part of an oboe sound) are periodic enough and sufficiently
free from prominent noise components that the frequency representation of a stable
96
sinusoid can be recognized easily in a single spectrum. A practical solution is to detect as
many peaks as possible and delay the decision of what is a deterministic, or "well
behaved" partial, to the next step in the analysis: the peak continuation algorithm [2].
However, in this project work, we track all the available tracks just like a Mcaulay-
Quartieri algorithm does.
A "peak" is defined as a local maximum in the magnitude spectrum, and the only
practical constraints to be made in the peak search are to have a frequency range and a
magnitude threshold. In fact, we should detect more than what we hear and get as many
sample bits as possible from the original sound, ideally more than 16. The measurement
of very soft partials, sometimes more than 80dB below maximum amplitude, will be
difficult and they will have little resolution. These peak measurements are very sensitive
to transformations, because as soon as modifications are applied to the analysis data,
parts of the sound that could not be heard in the original can become audible. The
original sound should be as clean as possible and have the maximum dynamic range, and
then the magnitude threshold can be set to the amplitude of the background noise floor.
Due to the sampled nature of the spectra returned by the FFT, each peak is accurate only
to within half a sample. A spectral sample represents a frequency interval of Fs/N Hz,
where Fs is the sampling rate and N is the FFT size. Zero-padding in the time domain
increases the number of spectral samples per Hz and thus increases the accuracy of the
simple peak detection. However, to obtain frequency accuracy on the level of 0.1% of the
distance from the top of an ideal peak to its first zero crossing (in the case of a
Rectangular window), the zero-padding factor required is 1000. A more efficient spectral
97
interpolation scheme is to zero-pad only enough so that quadratic (or other simple)
spectral interpolation, using only samples immediately surrounding the maximum-
magnitude sample, suffices to refine the estimate to 0.1% accuracy.
Figure 3.12: Missed Peaks
In real cases, some peaks might lie below the threshold of hearing. This is because, the
partials that were not audible before modifications may be clearly perceivable after
modifications. The SMS specifies to go 80 decibels below the threshold of hearing.
However, the sinusoidal model involved in this project is not a sine plus noise model.
The peak detection algorithm only finds the prominent peak around a local maximum
separated by a specified distance between peaks. We went 10 or 20 decibels below the
98
threshold of hearing to pick the peaks that were missed. In this project we are not going
to deal much about modifications because this is a data reduction scheme.
Figure 3.13: Peaks below threshold
Figure: The figure showing peaks picked and its appropriate phase matches. Some of the
peaks below threshold were missed. This problem is corrected by choosing threshold 10
dB below actual threshold
99
Peak Continuation
Once the spectral peaks corresponding to the low frequency and high frequency bands of
the current frame have been detected, the peak continuation algorithm adds them to the
incoming peak trajectories. The basic idea of the algorithm is that a set of "guides"
advances in time through the spectral peaks, looking for the appropriate ones (according
to the specified constraints) and forming trajectories out of them. Thus, a guide is an
abstract entity which is used by the algorithm to create the trajectories and the trajectories
are the actual result of the peak continuation process. The instantaneous state of the
guides, their frequency, phase and magnitude, are continuously updated as the guides are
turned on, advanced, and finally turned off. The schemes used in the sinusoidal model
(McAulay and Quatieri, 1984; 1986) [5] find peak trajectories both in the noise and
deterministic parts of a waveform, thus obtaining a sinusoidal representation for the
whole sound. These schemes are unsuitable when we want the trajectories to follow just
the partials. For example, when the partials change in frequency substantially from one
frame to the next, these algorithms easily switch from the partial that they were tracking
to another one which at that point is closer. In this project some parameters for the input
to the algorithm. Therefore the process of tracking the music and noise is not fully
automatic. The specifications of the parameters are mentioned in the forthcoming
passages.
100
Initial Guides
With this parameter, the user specifies the approximate frequency of the partials that are
known to be present in the sound, thus reserving guides for them. The algorithm adds
new guides to this initial set as it finds them. When no initial guides are specified, the
algorithm creates all of them. One another method is to create initial guides at some
equal intervals and allow the algorithm to update the guides using the data set being
tracked.
Maximum Peak Deviation
Guides advance through the sound, selecting peaks. This parameter allows a control of
the maximum allowable frequency distance from a peak to the guide that is selected by.
It is useful to make this parameter a function of frequency in such a way that the
allowable distance is bigger for higher frequencies than for lower ones. Thus the
deviation can follow a logarithmic scale, which is perceptually more meaningful than a
linear frequency scale. However, since the model used in this project tracks both noisy
components and meaningful music components, we do not use logrithmic scale.
Peak Contribution to Guide
The frequency of each guide does not have to correspond to the frequency of the actual
trajectory. It is updated every time it incorporates a new peak. This parameter is a
101
number from 0 to 1 that controls how much the guide frequency changes when a new
peak is incorporated. That is, given that the current guide has a frequency f`~, what will
be its value when it incorporates a peak with frequency h. For example, if the value of
the parameter is 1, it means that the value of the guide, f`~, is updated to h, thus the peak
makes the maximum contribution. If the value of the parameter is smaller, the
contribution of the peak is correspondingly smaller; the new value falls between current
value of f`~, and h. This parameter is useful, for example to circumscribe a guide to
narrow frequency band.
Maximum Number of Guides
This is the maximum number of guides used by the peak-continuation process at each
particular moment in time. The total number of guides may be bigger when a guide is
turned off a new one can use its place.
Minimum Starting Guide Separation
A new guide can be created at any frame from a peak that has not yet been incorporated
into any existing guide. This parameter specifies the minimum required frequency
separation from a peak to the existing guides in order to create a new guide at that peak.
Consequently, through this parameter peaks which are very close to existing guides can
be rejected as candidates for starting guides.
102
Maximum Sleeping Time
When a guide has not found a continuation peak for a certain number of frames, the guide
is killed. This parameter specifies the maximum “non active” time, that is, the maximum
number of frames that the guide can be alive while not finding continuation peaks.
Maximum Length of Filled Gaps
Given that a certain sleeping time is allowed, we may wish to fill the resulting gaps. This
parameter specifies the length of the biggest gap to be filled (a number smaller or equal
than maximum sleeping time). The gaps are filled by interpolating between the end
points in the trajectory.
Minimum Trajectory Length
Once all the trajectories are created, this parameter controls the minimum trajectory
length. All trajectories shorter than this length are deleted.
The last two specifications are optional. In this research work, the last two were not
used. If the minimum trajectory lengths are deleted, that will constitute most of the part
of noise. Only when a sine plus noise model is employed, these specifications will turn
into useful ones.
103
To describe the peak continuation algorithm let us assume that the frequency guides were
initialized with initial guides and that they are currently at frame n. Suppose that the
guide frequencies at the current frame are f1~, f2
~, f3~… fp
~, where p is the number of
existing guides. We want to continue the p guides through the peaks of frame n with
frequencies g1, g2, g3….gm, thus continuing the corresponding trajectories. There are
three steps in the algorithm: (1) Guide advancement, (2) Update of guide values, and (3)
start of new guides. Next these steps are described.
Guide Advancement
Each guide is advanced through frame n by finding the peak closest to its current value.
The rth guide claims frequency gi for which | f~r~gi | is a minimum. The change in
frequency must be less than maximum peak deviation. The possible situations are as
follows:
1. If a match is found within the maximum deviation, the guide is continued (unless
there is a conflict to resolve. The selected peak is incorporated into the
corresponding trajectory.
2. If no match is found, it is assumed that the corresponding trajectory must “turn
off” entering frame n, and its current frequency is matched to itself with zero
magnitude. Since the trajectory amplitudes are linearly ramped from one frame to
the next, the terminating trajectory ramps to zero over the duration of one hop
104
size. Whether the actual guide is “killed” or not, depends on the maximum
sleeping time.
3. If a guide finds a match which has already been claimed by another guide, we
give the peak to the guide that is closest in frequency, and the “loser” looks for
another match. If the current guide loses the conflict, it simply picks the best
available non-conflicting peak which is within the maximum peak deviation. If
the current guide wins the conflict, it calls the assignment procedure recursively
on behalf of the dislodged guide. When the dislodged guide finds the same peak
and wants to claim it, it sees there is a conflict which it loses and moves on. This
process is repeated for each guide, solving conflicts recursively, until all possible
matches are made.
Update of guide values
Once all the existing guides and their trajectories have been continued through frame n,
the guide frequencies are updated. There are two possible situations:
1. If a guide finds a continuation peak, it’s frequency is updated from f ~r to hr
according to
h~r = α (gi – f~
r) + f~r α ∈[0, 1]
Where gi is the frequency of the peak that the guide has found at frame n, and
alpha is the peak contribution to guide. When alpha is 1 the frequency of the peak
trajectory is the same than the frequency of the guide, therefore the difference
Between guide and trajectory is lost.
105
2. If a guide does not find a continuation peak for maximum sleeping time frames, the
guide is killed at frame n. If it is still under the sleeping time it keeps the same
Value (its value can be negated in order to remember that it has not found a peak).
When maximum sleeping time is 0 any guide that does not find a continuation peak at
frame n is killed. In order to distinguish between guides that find a continuation peak at
frame n is killed. In order to distinguish between guides that find a continuation peak
from the ones that do not but still are alive, we refer to the first ones as active guides and
the second ones as sleeping guides.
Start of New Guides
New guides, and therefore new trajectories, are created from the peaks of frame n that are
not incorporated into trajectories by the existing guides. If the number of current guides
is smaller than maximum number of guides a new guide can be started.
A guide is created at frame n by searching through the “unclaimed” peaks of the frame
for the one with the highest magnitude which is separated from every existing guide by at
least minimum staring guide separation. The frequency value of the selected peak is the
frequency of the new guide. The actual trajectory is started in the previous frame, n-1,
where its amplitude value is set to 0 and its frequency value to the same as the current
frequency, thus ramping in amplitude to the current frame. This process is recursively
done until there are no more unclaimed peaks in the current frame, or the number of
guides has reached maximum number of guides.
106
In order to minimize the creation of guides with little chance of surviving, a temporary
buffer is used for the starting guides. The peaks selected to start a trajectory are stored
into this buffer and continued by only using peaks that have not been taken by the
“consolidated” guides. Once these temporary guides have reached a certain length they
become “normal guides”.
Figure 3.14: Peak-continuation process. Here, g represent the guides and p the spectral
peaks. The magnitude, frequency and phase information at p form input to the oscillator
For the case of harmonic sounds these guides could be created at the beginning of the
analysis, setting their frequencies according to the harmonic series of the first
fundamental found, and for inharmonic sounds each guide is created when it finds the
first available peak. When a fundamental has been found in the current frame, the guides
can use this information to update their values. Also the guides can be modified
depending on the last peak incorporated. Therefore by using the current fundamental and
the previous peak we control the adaptation of the guides to the instantaneous changes in
107
the sound. For a very harmonic sound, since all the harmonics evolve together, the
fundamental should be the main control, but when the sound is not very harmonic, or the
harmonics are not locked to each other and we cannot rely on the fundamental as a strong
reference for all the harmonics, the information of the previous peak should have a bigger
weight.
However, we do not focus on just harmonic partials. This project involves an algorithm
that tracks any sound, harmonic or inharmonic. Each peak is assigned to the guide that is
closest to it and that is within a given frequency deviation. If a guide does not find a
match it is assumed that the corresponding trajectory must "turn off". In inharmonic
sounds, if a guide has not found a continuation peak for a given amount of time the guide
is killed. New guides, and therefore new trajectories, are created from the peaks of the
current frame that are not incorporated into trajectories by the existing guides. If there are
killed or unused guides, a new guide can be started. A guide is created by searching
through the "unclaimed" peaks of the frame for the one with the highest magnitude
The peak continuation algorithm presented is only one approach to the peak continuation
problem. The creation of trajectories from the spectral peaks is compatible with very
different strategies and algorithms; for example, hidden Markov models have been
applied. An N Markov model provides a probability distribution for a parameter in the
current frame as a function of its value across the past N frames.
108
With a hidden Markov model we are able to optimize groups of trajectories according to
defined criteria, such as frequency continuity. This type of approach might be very
valuable for tracking partials in polyphonic sounds and complex inharmonic tones. In
particular, the notion of "momentum" is introduced, helping to properly resolve crossing
fundamental frequencies [3].
Additive Synthesis: PSMS
A short tutorial on additive synthesis was given in chapter 2. The peak continuation
algorithm returns the values of the prominent peaks organized into frequency trajectories.
Each peak is a triad (Ar, ω r, φ r) where l is the frame number and r is the track number to
which it belongs.
The synthesis process takes these trajectories, or their modification, and computes one
frame of synthesized sound s (t) by
S(t) = A∑=
Rl
r 1r cos [tω r +φ r ]
Where Rl is the number of trajectories present in frame l and S is the length of the
synthesis frame. A synthesis frame is S samples long and does not correspond to an
analysis frame. Without time scaling the synthesis frame l goes from the middle of
analysis frame l-1 to the middle of analysis frame l, i.e., corresponds to the analysis hop
size. The final sound s (t) results from the juxtaposition of all synthesis frames [2].
109
In SMS scheme usually to avoid click between frames, a peak interpolation strategy is
followed. In this project we replace the peak interpolation with a smooth cubic spline
interpolation stage.
Cubic Spline Interpolation
In any SMS or MQ- Synthesis scheme, a peak continuation scheme will be followed by a
peak interpolation scheme. Peak interpolation is the process of interpolating the
amplitude, frequency and phase parameters of each sinusoidal track smoothly between
frames. Peak interpolation helps in avoiding the sharp clicks between frames. A click or
a crack is an undesirable audio effect and is always subjected for cleaning. In digital
audio restoration, an undesirable noise include, clicks/cracks or steady state hiss. This
hence is an interpolation in frequency domain. However, in this research project we have
replaced this peak interpolation with a smooth cubic spline interpolation method. This
involves smoothly interpolating the abrupt voltage levels between the frame edges. An
usual method of writing a vinyl restoration code is to first detect the click and then fix it.
However, our implementation is simple because, it is already known that the crack exists
only at frame edges.
110
Figure 3.15: (Top) with cracks; (Bottom) without cracks
At the heart of most click removal methods is an interpolation scheme which replaces
missing or corrupted samples with estimates of their true value. It is usually appropriate
to assume that clicks have in no way interfered with the timing of the material, so the task
is then to fill in the ‘gap’ with appropriate material of identical duration to the click. As
discussed with appropriate material of identical duration to the click. This amounts to an
interpolation problem which makes use of the good data values surrounding the
corruption and possibly takes account of signal information which is buried in the
corrupted sections of data. An effective technique will have the ability to interpolate gap
lengths from one sample up to at least 100 samples at a sampling rate of 44.1 KHz [28].
111
Figure 3.16: Cubic spline interpolation
However, the entire process employed here in this work is not automatic. The user has to
specify or change a parameter depending on the sound. A specific number of points are
selected around the crack center spotted at the frame edge, i.e, few points in the previous
frame and few points from current frame and are smoothly interpolated using the cubic
spline interpolation technique. The amount of points that form input to the interpolation
algorithm depends on the nature of sound in particular the wavelength and zero crossings.
For a low frequency sound we might want to specify a big number and for high frequency
sound a small number to arrive at a desirable output.
112
FUSION OF SENSITIVE AND LESS SENSITIVE SPECTRA
Once the PCM samples of the MF region are demodulated to the original frequency band
by the demodulator, we fuse the synthesized bands and the MF band. The synthesized
bands are synthesized from the channel parameters. Here is a model of the fusion.
Figure 3.17: Spectral Fusion (A more general model)
DISTORTION AT TRANSITION BAND EDGES Distortion at transition bands could be expected sometimes. These distortions may be
phase distortions or other type of distortions at the transition edges between frequency
bands. This may be suppressed by using the same filter that we have used before
113
analysis, this time at the synthesis end before fusing the spectral bands to form final
output and this turns successful.
The Downsampling method
This is the second method to be discussed after the two filter method. The downsampling
method is applicable only for very low frequency sounds. According to the duplex
theory of pitch perception, low frequencies have very poor frequency resolution but
better time resolution and high frequencies have better spectral resolution and poor time
resolution. At 640 Hz both, temporal and pattern recognition effects are sensed [7].
Figure 3.18: The downsampling method.
114
The reason is because of the fact that the human auditory system follows time impulses in
the LF region. The downsampling method targets two things. The first one is that since
the LF sinusoids have poor spectral resolution, the Low-frequency sound is downsampled
by a factor two in time. This shifts all the tracks by a factor two on frequency scale.
Therefore the band rejection filter must be designed with cutoff frequencies that are two
times the cutoff frequencies that rejected the sensitive mid band in the two filter method.
The pitch increases by a factor two and time decreases by a factor two. Hence spectral
resolution of tracks is increased. This facilitates the peak tracking process.
The second target is data reduction. Data reduction will also be achieved because the
time is compressed by a factor two when we downsample. Finally, before resynthesizing,
a modification factor two is set and frequency parameters were divided by two to get
back the original time. However, this method is applicable only for very low frequency
sounds because the sampling rate of the signal should be taken into consideration before
shifting the frequency components. Moreover, the sound is comparatively defective if
modification algorithms are not robust. Though the downsampling method only fits the
low frequency instruments like bass guitar or kick drum, downsampling method could be
effectively used in the four filter method that follows.
115
Duplex theory of pitch perception: Applications: Four-Filter method
Figure 3.19: High Spectral resolution Four-filter method
The four-filter PSMS makes use of the duplex theory of pitch perception discussed in
chapter 2. The dual purpose of the four filter method is more spectral resolution for
better tracking and re-synthesis and more chances of data reduction. The four filter
method improves the two filter method in terms of fidelity, and data reduction.
Moreover, logically it sets a unique example for better ways to analyze signals,
“truthfully” in future. This way the tradeoff in time and frequency resolution can be
compensated to some extent. According to the duplex theory of pitch perception, the low
116
frequencies have poor spectral resolution and high time resolution whereas the high
frequencies have poor time resolution and high spectral resolution.
Figure 3.20: A schematic picture explaining how the analysis frame length is changed
over the frequency scale.
At a 640 Hz cross over frequency, both temporal and pattern recognition resolution is
there. Regardless of the matter which domain, that we need to set high resolution, we
need spectral frequency parameters that are very close to what would be original when
we synthesize the sound. Therefore spectral resolution (number of bins) cannot be
varied. Hence, a variable analysis time frame length is employed to achieve a better
PSMS system. A narrow analysis frame length (35-45 msec) is set for low-frequency
sounds and a wider analysis frame length (100 msec) is set for high frequency sounds.
The downsampling method is optional because, downsampling method works well only
for low-frequency harmonic signals.
117
Figure 3.21: The four filter method
The input sound is filtered using four band pass filters. A DC to 0.7 kHz filter, 0.5 kHz
to 1 kHz, 0.9 kHz to 6.1 KHz (mid) and a 5.9 kHz to 20 KHz were used to filter the
different frequency bands. Each filtered frequency band except the mid band form input
to the sinusoidal model. Variable time frame length overlapping frames were set before
sending the pass band signal into the sinusoidal model. A low frequency band has a short
analysis frame length and a high frequency has a little long frame length.
This method is bit expensive but the resolution is comparatively good and it is an
extremely good tactic for enhanced data reduction.
118
Figure 3.22: (Top) High resolution (Bottom) Low spectral resolution
Analysis of spectral resolution of low frequency sounds (0-700 Hz)
Note: There is no zero padding involved in the above plots
119
However, the parametric information through the channel for both two-filter method and
four-filter method almost remains the same. A two filter method captures large number
of parameters from two wide spectral bands whereas a four frequency method captures
less number of parameters from three spectral bands.
Discarding Phase in HF regions
In HF regions, the ear follows the amplitude envelope of the frequency spectrum and
leaves out its phase content. This psychological evidence was used in our sinusoidal
model by discarding the HF phase parameters that are to be fed into the oscillator.
In the four filter method, the fourth filter chops off the lower spectral details, leaving us
with only high frequency spectral energy. The amplitude and frequency parameters
corresponding to this HF band were transmitted. The phase parameters were discarded
because they have no perceptual importance in the auditory scene. The experiment was
conducted with a cymbal crash sound and the results are shown in Figure 3.23.
120
Figure 3.23 (a): Synthesizing HF band with phase parameters
Figure 3.23 (b): Synthesizing HF band without phase parameters
121
Figure 3.24: Plots explaining the synthesis of different band that have various analysis
frame length and their final fusion (Four filter method)
122
Hence, it was experimentally verified that the phase does not have very significant sonic
meaning in the high frequencies. This way the four filter method was improved by
compressing more audio data by chopping off useless audio data that traverses the
channel to the synthesis port. Experiments were conducted both on non harmonic
instruments like cymbal crash and harmonic instruments like harmonium. The results of
the experiment clearly indicate that phase information is not perceived in higher
frequencies. Therefore high-frequency phase parameters can be discarded because they
do not have much perceptual importance. Hence, one-third of the high-frequency
information through the channel is reduced.
POSSIBILITIES OF MODFIFICATION
Figure 3.25: Possibilities of Modifications
123
The most powerful option of the analysis-synthesis based schemes is that the music after
analysis can be modified. The modifications can be different music applications for
music production and composition like the traditional time/pitch scale modifications,
reverberation, chorusing, harmonizing and other music production applications.
Our synthesis based method proposes some modifications. However the modifications
are little complex and involves lot of computation compared to other schemes. The
sinusoidal parameters are modified using a modification instruction to get the desired
modification effect for the LF and HF regions. The mid spectrum is modified using a
phase vocoder to attain a desired sound effect. The two sounds are added together. The
phase vocoder used here is a commercially available tool.
Cross Effects
Figure 3.26: Cross Effect using PSMS modifications
124
The unique modification possibility in this project would be to introduce meaningful
cross effects. For example we may obtain a pitch shift effect by modifying the sinusoidal
parameters and use a harmonizing effect for mid spectral samples using a phase vocoder.
Here is an example above where the mid spectrum is introduced to chorusing effect using
a commercial phase vocoder and the frequency parameters of the LF-HF sinusoidal
parameters are multiplied by random numbers. The cross effect provides some
meaningful “digital effect” as output.
Demonstration: Advantages over sine plus noise model
The fidelity and data reduction are increased if a traditional sinusoidal plus noise model is
replaced with a pure sinusoidal model as followed in this thesis work (two filter method).
Noise generally consists of musical excitation and other noisy components. Infact SMS
is sometimes used in denoising applications. While a pure deterministic model consists
of only perfect harmonic tracks, the noise may have the nearby harmonic frequencies of
the deterministic tracks. These near by harmonics contribute a lot to the music. When
these nearby harmonics are modeled using a LPC estimate, the user need to send LPC
coefficients close to analysis frame length to get the best estimate in order to minimize
the error. Therefore it becomes an additional burden on the channel. If less LPC
coefficients are sent, the nearby harmonics are not properly modeled as stochastic noise
when the LPC filter is excited with a white noise.
125
In most of the real world cases, most of the non musical noises are associated in LF or
HF regions as rumblings (LF) or hiss sounds (Tape HF). In the two filter method, we
send all the sensitive nearby harmonics as PCM samples. The noises in the LF and HF
that are less sensitive are modeled stochastically. Therefore the resulting residual sound
comprising of mid nearby harmonics and musical noise plus stochastic LF and HF noise
sounds better than a full stochastic model. Moreover, the number of LPC coefficients
will also decrease by a greater factor. This becomes a better model in all terms. A
demonstration by pictures and plots is shown below.
Figure 3.27 (a)
126
Figure 3.27 (b)
Figure 3.27 (c)
127
Figure 3.27 (d)
Figure 3.27 (e)
128
Figure 3.27 (f)
Figure 3.27 (g)
129
Figure 3.27 (h)
Figure 3.27 (i)
130
Figure 3.27 (j)
Figure 3.27 (k)
Figure 3.27(a-k): Demonstration: Disadvantages of sine plus noise model in a two filter method
131
CHAPTER 4
TESTS AND RESULTS
Tests were conducted to analyze the success level of this research project. The results of
the same were plotted. The two filter method and four filter method were analyzed
qualitatively and the bit reduction and audio compression ratio were tabulated.
Qualitative analysis
Listening tests were conducted on 12 subjects. Five had musical ears and the rest
weren’t. The subjects were instructed to rate the quality of the output sound by
comparing it to the input sound. They were specifically asked to judge and take into
account any possible distortions in the output. They were asked to rate the output sound
ranging from 1 to 5, 1 being worst and 5 being best. Decimal values in between these
integers were allowed so that the subject could rate the sound to the best possible extent
while making critical judgments. The qualitative results for the two filter method are
plotted below:
132
Figure 4.1: Qualitative Results: Music Genre: Two filter method
Figure 4.2: Qualitative Results: Tonal Instruments: Two filter method
133
Figure 4.3: Qualitative Results: Percussion Instruments: Two filter method
The qualitative results for the four filter method are plotted below:
Figure 4.4: Qualitative Results: Music Genre: Four filter method
134
Figure 4.5: Qualitative Results: Tonal Instruments: Four filter method
Figure 4.6: Qualitative Results: Percussion Instruments: Four filter method
135
Bit rate calculation
The bit rates for all the above sounds were calculated using both two filter method and
four filter method. The mid frequency sensitive signal that forms input to the channel as
PCM samples were reduced by an audio compression ratio 4:1. For example, a sound
sampled at a CD rate of 44.1 KHz, 705 Kbps (mono) will be reduced to 176 kbps. The
parameters (Amplitudes, frequencies, Phases) were converted to binary format to
calculate the number of bits per second. The number of kbps from LF-HF sinusoidal
parameters were added to the number of kbps from the sensitive PCM samples to find the
bit rate and audio compression ratio. All the audio files were mono.
Music Instrument/Genre
Two filter Method
Four filter method
Rock 44.1 KHz, 705 kbps
343.58 kbps (PSMS) + 176 (mid) kbps = 519.58 kbps (3:2)
23.194 kbps (LF1) + 19.405 kbps (LF2) + 176 kbps (mid) + 83.421 (HF) = 302.02 kbps (2:1)
Classical 44.1 KHz, 705 kbps
66.134 kbps (PSMS) + 176 kbps (mid) = 242.134 kbps (3:1)
12.635 kbps (LF1) + 13.632 kbps (LF2) + 176 kbps(mid) + 11.824 kbps (HF) = 214.091 kbps (3:1)
Country 44.1 KHz, 705 kbps
101.32 kbps (PSMS) + 176 kbps = 277.32 kbps (2:1)
18.021 kbps (LF1) + 20.453 kbps (LF2) + 176 kbps (mid) + 23.735 kbps (HF) = 238.209 kbps (3:1)
Jazz 44.1 KHz, 705 kbps
141.95 kbps (PSMS) + 176 kbps (mid) = 317.95 kbps (2:1)
14.318 kbps (LF1) + 15.04 kbps (LF2) + 176 kbps + 34.173 kbps (HF) = 239.531 kbps (3:1)
136
Speech 38 KHz, 608 kbps
84.592 kbps (PSMS) + 152 kbps (mid) = 236.592 kbps (2:1)
14.23 kbps (LF1) + 14.21 kbps (LF2) + 152 kbps (mid) + 28.91 kbps (HF) = 209.35 kbps (3:1)
Gottuvadhyam 44.1 KHz, 705 kbps
84.310 kbps (PSMS)+ 176 kbps (mid) = 260.310 kbps (3:1)
18.515 kbps (LF1) + 20.238 kbps (LF2) + 176 kbps (mid) + 23.447 kbps (HF) = 238.2 kbps (3:1)
Pipa 48 KHz, 768 kbps
61.479 kbps (PSMS) + 192 kbps = 253.479 kbps (3:1)
12.707 kbps (LF1) + 17.179 kbps + 192 kbps (mid) + 13.539 kbps (HF) = 235.425(3:1)
Sitar 44.1 KHz, 705 kbps
187.72 kbps (PSMS) + 176 kbps (mid) = 363.72 kbps (2:1)
17.865 kbps (LF1) + 18.578 kbps (LF2) + 176 kbps (mid) + 53.838 kbps (HF) = 266.281 kbps (3:1)
Flute 44.1 KHz, 705 kbps
49.375 kbps + 176 kbps (mid) = 225.375 kbps (3:1)
14.146 kbps (LF1) + 13.365 kbps (LF2) + 176 kbps (mid) + 26.649 kbps (HF) = 230.16 kbps (3:1)
Violin 44.1 KHz, 705 kbps
122.15 kbps (PSMS) + 176 kbps (mid) = 298.15 kbps (2:1)
15.694 kbps (LF1) + 14.016 kbps (LF2) + 176 kbps (mid) + 30.016 kbps (HF) = 235.726 kbps (3:1)
Piano 24 KHz, 384 kbps
14.835 kbps (PSMS) + 96 kbps(mid)= 110.835 kbps (4:1)
5 kbps (LF1) + 3.464 kbps (LF2) + 96 kbps (mid) + 1.815 kbps (HF) = 106.279 kbps (4:1)
Acoustic Guitar 32 KHz, 512 kbps
14 kbps (PSMS) + 128 kbps(mid) = 142 kbps
5.364 kbps (LF1) + 3.697 kbps (LF2) + 128 kbps
137
( 3:1 )
(mid) + 0.256 kbps (HF) = 137.317 kbps ( 4:1)
Electric Bass 30 KHz, 480 kbps
23.276 kbps (PSMS) + 120 kbps (mid) = 143.276 kbps (3:1)
13.165 kbps (LF1) + 4.57 kbps (LF2) + 120 kbps (mid) + 0.388 kbps (HF) = 138.123 kbps (3:1)
Low Tom 44.1 kbps, 705 kbps
26.929 kbps (PSMS) + 176 kbps = 202.929 kbps (3:1)
3.554 kbps (LF1) + 5.1 kbps (LF2) + 176 kbps (mid) + 6.155 kbps (HF)= 190.809 kbps (3:1)
High Tom 44.1 kbps, 705 kbps
48.768 kbps (PSMS) + 176 kbps (mid) = 224.768 kbps (3:1)
5.572 kbps (LF1) + 3.99 kbps (LF2) + 176 kbps + 9.861 kbps (HF) = 195.423 kbps (3:1)
Closed snare 44.1 kbps, 705 kbps
354.21 kbps (PSMS) + 176 kbps (mid) = 530.21 kbps (3:2)
18.141 kbps (LF1) + 16.343 kbps (LF2) + 176 kbps + 76.732 kbps (HF) = 287.216 kbps (2:1)
Open Hi-hat 44.1 kbps, 705 kbps
419.49 kbps (PSMS) + 176 kbps (mid) = 595.49 kbps (3:2)
3.843 kbps (LF1) + 1.645 kbps (LF2) + 176 kbps (mid) + 94.578 kbps (HF) = 276.066 kbps (3:1)
Mid-Tom 44.1 kbps, 705 kbps
53.304 kbps + 176 kbps (mid) = 229.304 kbps (3:1)
6.532 kbps (LF1) + 3.91 kbps (LF2) + 176 kbps (mid) + 13.507 kbps (HF) = 200 kbps (3:1)
Cymbal Crash 44 KHZ, 705 kbps
431.965 kbps (PSMS) + 176 kbps (mid) = 607.965 kbps (3:2)
2.456 kbps (LF1) + 14.2 kbps (LF2) + 176 kbps (mid) + 89.151 kbps (HF) = 281.807 kbps (2:1)
Table 4.1: Bit rates and compression ratios for two filter and four filter method
138
Earlier, in the third chapter, we mentioned that one need to go below the threshold of
hearing to pick peaks because the peaks that were not audible may be perceived after
modifications. However, the primary focus of this research is data reduction and
modifications have lesser significance. Therefore, when one wants to have good bit rate
reduction, the factor below threshold of hearing must be set to zero.
139
CHAPTER 5
CONCLUSION
It was analyzed that the interpolation of voltages between time frames was not
completely successful. Therefore the output quality was affected to a smaller extent.
However, if a commercial SMS system is used, this partial synthesis idea can certainly
better the existing schemes by means of data reduction and fidelity. However the
flexibility in modification is affected due the “partial synthesis”. The bit reduction rates,
compression ratio are mentioned in the last table of this chapter. Finally the various
features of this synthesis scheme are compared to normal perceptual coding schemes and
presented in that table.
Future Extensions
The future extensions of this project include transient modeling of the low and high
frequency sounds using the sine plus transient plus noise scheme mentioned in [27]. The
entire project could be implemented for MIDI files. There is a curve dip at the HF rage
of the Fletcher-Munson contours. This dip could be made use of as we did for mid
frequency dip in this research. However, the dip is not a common one to all age groups.
Old people sometimes don’t have access to the high frequencies. A run length coding
could be done for parameters representing frequency tracks by sending the (frequency
parameter, length of the string) as mentioned in [23]. One another attempt would be to
use loudness variable band pass and band rejection filters in the two filter method.
140
However, this may not induce a big change in data reduction. A perceptual coder may be
used for coding the mid frequency signals in the future.
PROS AND CONS: A COMPARTITIVE STUDY
Feature
Perceptual Audio Coding
PSMS
Bit Allocation
Dynamic & Adaptive
There is no bit pool but bits are
assigned based on sensitivity and is
a one step external decision and no
adaptive techniques involved.
Bit Rate
Fixed bit rates
Depends on the sound
Psychoacoustic
base point
Loudness perception & less of
pitch perception
Pitch perception
Channel Noise
No
Chances do prevail being a
communication system
141
Applications
Storage space reduction,
Television & Radio broadcast
Satellite transmission,
Military & Musical applications.
Data reduction, Television & Radio
broadcast Satellite transmission,
Military & Musical applications.
Commercial
Usage
Used commercially today
Future commercial product
Table 5.1: Pros and cons: Perceptual coding and PSMS
Instead of the SMS/MQ synthesis technique that was used for synthesizing the less
sensitive LF-HF regions, other synthesis techniques could be tried to verify and test if
which synthesis technique could be the best fit for the system.
A low-bit-rate coder could be used to code the sensitive signal instead of the entire signal.
This will facilitate accurate coding of sensitive signal. The less sensitive signals can be
synthesized using SMS. Both the synthesized signal and low bit rate coded signal may be
added and quality may be ensured.
142
BIBLIOGRAPHY: 1. Fletcher, Munson.W.A, “Loudness, its definition, measurement and calculation,“ Journal of Acoustical Society of America, 5 (1933) : 82-108. 2. Xavier Serra, “A system for analysis/transformation/synthesis based on a deterministic plus stochastic decomposition” (Ph.D. diss., Stanford University, 1989). 3. Tuomas Virtanen, “Audio Signal Modeling with Sinusoids plus Noise” (Master of Science Thesis, Tampere University of technology). 4. Jean Laroche, Mark Dolson, “New phase-vocoder techniques for pitch-shifting, harmonizing and other exotic effects,” Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, New York, 1999) , 17-20. 5. McAulay, R.J, Quatieri,T.F, “Speech analysis/synthesis based on a sinusoidal representation” IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-34, no.4 (1986): 744-754. 6. Albert.Bregman, Auditory Scene Analysis, (Cambridge: MIT Press, 1994). 7. Perry R. Cook, Music, Cognition and Computerized sound, Introduction to Psychoacoustics, (Cambridge: MIT Press, 2001) 8. Ken C. Pohlmann, Principles of Digital audio, Fourth Edition, (New York: McGraw-Hill Publications, 2000). 9. Curtis Roads, The Computer Music Tutorial, (Cambridge: MIT Press, 1996). 10. T. H. Andersen and K. Jensen, “Importance and representation of phase in the sinusoidal model,” Journal of the Audio Engineering Society, 52 no.11 (2004) :1157-1169. 11. Serra, X, Smith, J, “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition”, Computer Music Journal 14, no.4 (1990):12-24 12. Website: http://www.mmk.ei.tum.de/persons/ter.html 13. Terhardt, E and Seewann, S.G “Algorithm for extraction of pitch and pitch salience from complex tonal signals,” Journal of the Acoustical Society of America 71 (1982): 679-688 14. Ernst Terhardt, “Calculating virtual pitch” Hear. Res 1 (1979): 155-182 15. Kelly Fitz, “The Reassigned Bandwidth-Enhanced Method of Additive Synthesis,"
143
(Ph. D. diss., University of Illinois at Urbana-Champaign, 1999). 16. Scheirer, Eric D., and Barry L. Vercoe, “SAOL: The MPEG-4 Structured Audio Orchestra Language,” Computer Music Journal 23 no.2 (1999): 31-51. 17. Brian CJ.Moore, An Introduction to the Psychology of Hearing (San Diego: Academic Press, 2003) 18. Plomp, “Pitch of complex tones,” Acoustic society of America, 41, no.6 (1967): 1526-1533 19. Rodet, X. and P. Depalle, “Spectral Envelopes and Inverse FFT Synthesis”, 93rd Convention of Audio Engineering Society (San Francisco, 1992) 20. McAulay, R.J. and T.F. Quatieri, “Speech Transformations based on a sinusoidal representation,” IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP 34 no. 6 (1986): 1449-1464. 21. Eronen A, and Klapuri A, "Musical Instrument Recognition Using Cepstral Coefficients and Temporal Features,” In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2000, (Istanbul, 2000), 753-756. 22. J. B. Allen and S. T. Neely, 1997. “Modeling the relation between the intensity just noticeable difference and loudness for pure tones and wideband noise,” Journal of Acoustical Society of America, 102 no. 6 (1997). 23. Sylvain Marchand, “Compression of sinusoidal modeling parameters”, Proceedings of the COST G-6 conference on digital audio effects (DAFX-00), (Verona, 2000), 273-276 24. T. H. Andersen and K. Jensen. “Phase models in analysis/synthesis of voiced sounds”, In Proceedings of the DSAGM, (Copenhagen, 2001). 25. Charles Dodge and Thomas. A. Jerse, Computer music synthesis, composition and performance, second edition (New York: Prentice Hall Series) 26. Thomas Quatieri, Discrete time speech signal processing (New Jersey: Prentice Hall Series, 2001) 27. T. Verma, T. Meng, “Extended spectral modeling synthesis with transient modeling synthesis,” Computer music journal, 24 no.2 (2000): 47-59. 28. Simon J.Godsill and Peter J.W.Reyner, Digital audio restoration, (New York: Springer series, 1998
144
APPENDIX:
Sample plots obtained from experiments
Rock music, sampled at 44.1 KHz, 16 bit, mono, 705 kbps
Country music, sampled at 44.1 KHz, 16 bit, mono, 705 kbps
145
Speech, sampled at 44.1 KHz, 16 bit, mono, 705 kbps
Chinese Pipa, sampled at 44.1 KHz, 16 bit, mono, 705 kbps
146
Gottuvadhyam and Tabla, sampled at 44.1 KHz, 16 bit, mono, 705 kbps
Sitar and Tabla, sampled at 44.1 KHz, 16 bit, mono, 705 kbps