ANALYSIS OF SPEECH AT DIFFERENT SPEAKING ...web2py.iiit.ac.in/publications/default/download/masters...ANALYSIS OF SPEECH AT DIFFERENT SPEAKING RATES USING EXCITATION SOURCE INFORMATION
Post on 29-Mar-2018
220 Views
Preview:
Transcript
ANALYSIS OF SPEECH AT DIFFERENT SPEAKINGRATES USING EXCITATION SOURCE
INFORMATION
by
SRI HARISH REDDY MALLIDI
200431008
Master of Science (by Research)
in
Electronics and Communication Engineering
Speech and Vision Lab.
Language Technologies Research Centre
International Institute of Information Technology
Hyderabad, India
To my parents, friends and guide
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled“Analysis of Speech at Different
Speaking Rates Using Excitation Source Information” by Sri Harish Reddy Mallidi
(200431008), has been carried out under my supervision and is not submitted elsewhere
for a degree.
Date Adviser: Prof. B. Yegnanarayana
Acknowledgements
I would like to express my deepest respect and most sincere gratitude to Prof. B. Yeg-
nanarayana, for his constant guidance, encouragement at all stages of my work. I am
fortunate to have numerous technical discussions with him from which I have benefited
enormously. I thank him for the excellent research environment he has created for all of
us to learn.
I thank my thesis committee members, Prof. P. R. K Rao and Dr. Sudhir Madhav Rao
for sparing their valuable time to evaluate the progress of my research work. I am thankful
to Dr. Suryakanth, Mr. Kishore Prahallad and Dr. Rajendran for their immense support
and help through my research work. I am thankful to them for all the invaluable advise
on both technical and nontechnical matters. I thank my senior laboratory members for all
the cooperation, understanding and help I received from them. I will forever remember
the wonderful time I had with my friends.
Needless to mention the love and moral support of my family. This work would not
have been possible but for their support. Finally, I would like to dedicate this thesis to my
family, and my guide, Prof. B. Yegnanarayana.
Sri Harish Reddy Mallidi
i
Abstract
When humans modify speaking rate they do not perform a simpleexpansion or compres-
sion of the speech signal. In order to maintain the intelligibility and naturalness of the
speech, they modify some of the characteristics of the speech production mechanism in
a complex way. This causes the acoustic features extracted from the speech signal to
change in a complex way. These changes affect the performance of speech systems like
speech recognition, speaker recognition etc. Most of the studies on the effect of speaking
rate on acoustic features focus on features at segmental andsuprasegmental level. The
present work focuses on analysis of the effects of speaking rate on features at subseg-
mental level. Three features at subsegmental level, namely, instantaneous fundamental
frequency, strength of excitation of epoch and perceived loudness are chosen, and their
variation with speaking rate is studied.
It was observed that instantaneous fundamental frequency increases with increase in
speaking rate, but when speaking rate is decreased the changes in the instantaneous fun-
damental frequency is speaker-specific. Similar observations have been found in the case
of strength of excitation at epoch. Strength of excitation decreases with increase in speak-
ing rate, and the change in strength of excitation is speaker-specific when speaking rate
is decreased. The effect of speaking rate on the perception of loudness is also studied
using perceptual loudness studies. It was observed that fast speech was perceived louder
than normal speech for majority of speakers, whereas the difference between perception
of loudness of normal and slow speech is speaker-specific. Itwas also observed that
speaking rate does not have significant affect on objective loudness measure. A modified
measure of loudness in the case of speech at different speaking rates is proposed, and its
variations correlate with results from perceptual loudness tests.
iii
The variations of subsegmental level features with speaking rate are incorporated in a
non-uniform duration modification method. Subjective studies on the synthesized speech
showed that incorporation of the subsegmental variation improved the quality of speech
at different speaking rates.
Keywords: Speaking rate, spontaneous speech, segmental features, instantaneous funda-
mental frequency, strength of excitation, perceived loudness and duration modification.
iv
Contents
Abstract iii
List of Tables viii
List of Figures xi
1 Introduction 1
1.1 Sources of speaking rate variation . . . . . . . . . . . . . . . . . .. . 2
1.2 Spontaneous speech - a case study . . . . . . . . . . . . . . . . . . . .3
1.2.1 Speaking rate variation in spontaneous speech . . . . . .. . . . 4
1.2.2 Effect on the performance of speech systems . . . . . . . . . . 7
1.3 Objective and scope of the work . . . . . . . . . . . . . . . . . . . . . 8
1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Speaking rate - A Review 9
2.1 Studies based on articulatory dynamics . . . . . . . . . . . . . .. . . . 9
2.2 Studies based on acoustic features . . . . . . . . . . . . . . . . . .. . 10
2.2.1 Suprasegmental level . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Segmental level . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v
3 Effect of speaking rate on excitation source features 13
3.1 Extraction of excitation source features . . . . . . . . . . . .. . . . . 13
3.2 Speech material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Variation of instantaneousF0 . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Variation of strength of excitation . . . . . . . . . . . . . . . . .. . . 22
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Effect of speaking rate on loudness 27
4.1 Perceptual evaluation of loudness of speech at different speaking rates . 27
4.2 Extraction of loudness measure from the Hilbert envelope of the LP resid-
ual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Variation of loudness measure . . . . . . . . . . . . . . . . . . . . . .31
4.4 Proposed measure of perceptual loudness . . . . . . . . . . . . .. . . 32
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Incorporation of excitation source variations in synthesis of speech at dif-
ferent speaking rates 37
5.1 Variation of durations of voiced, unvoiced and silence .. . . . . . . . . 38
5.1.1 Identification of voiced, unvoiced and silence regions . . . . . . 39
5.1.2 Duration variations . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Synthesis of speech at different speaking rate . . . . . . . . . . . . . . 44
5.3 Evaluation of synthetic speech at different speaking rates . . . . . . . . 46
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Summary and Conclusions 48
vi
6.1 Summary of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Major contributions of the work . . . . . . . . . . . . . . . . . . . . . 50
6.3 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . 50
List of Publications 52
References 53
vii
List of Tables
5.1 Percentage deviation in the durations of speech segments for different
speaking rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Slopes (m) of the straight lines that fit the clusters shown in Fig. 5.3,and
root mean squared error (RMSE) of the fits to illustrate the variation in
duration modification of different units. . . . . . . . . . . . . . . . . . 44
5.3 Mean opinion scores and AB ranking test (in %). . . . . . . . . .. . . 47
viii
List of Figures
1.1 Sources of speaking rate variation [13] . . . . . . . . . . . . . .. . . . 3
1.2 Mean subtracted histograms of syllable rate for spontaneous speech. . . 5
1.3 Mean subtracted histogram of syllable rate for read speech. . . . . . . . 6
3.1 (a) Segment of a speech signal from VOQUAL’03 [57] database, (b) zero-
frequency filtered signal, (c) differenced EGG signal and epoch locations
marked by arrows, (d) strength of excitation calculated from the zero-
frequency filtered signal, and (e) contour of instantaneousF0. . . . . . . 14
3.2 Distribution of the measure of speaking rate for fast, normal, and slow
utterances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Distributions of instantaneousF0 for 4 male speakers. In each case, the
solid (‘—’), the dashed (‘- - -’), and the dotted (‘· · · ’) lines correspond to
normal, fast, and slow utterances, respectively. . . . . . . . .. . . . . . 18
3.4 Variation of the instantaneousF0 with speaking rate. (a) and (b) show
the results of intra-class and inter-class comparisons, respectively, for fast
vs normal case. (c) and (d) show the results of intra-class and inter-class
comparisons, respectively, for slow vs normal case. . . . . . .. . . . . 20
3.5 Cumulative probability density function of difference between mean val-
ues of instantaneousF0 of (a) fast utterances and normal utterances, and
(b) slow utterances and normal utterances. . . . . . . . . . . . . . .. . 22
ix
3.6 Distributions of strength of excitation (ǫ) for 4 male speakers. In each
case, the solid (‘—’), the dashed (‘- - -’), and the dotted (‘· · · ’) lines
correspond to normal, fast, and slow utterances, respectively. . . . . . . 23
3.7 Variation of the strength of excitation (ǫ) with speaking rate. (a) and (b)
show the results of intra-class and inter-class comparisons, respectively,
for fast vs normal case. (c) and (d) show the results of intra-class and
inter-class comparisons, respectively, for slow vs normalcase. . . . . . 24
3.8 Cumulative probability density function of difference between mean val-
ues of strength of excitation at epoch (ǫ) of (a) fast utterances and normal
utterances, and (b) slow utterances and normal utterances.. . . . . . . . 25
4.1 Evaluation of the effect of speaking rate on the perception of loudness.
Results of comparison of (a) fast speech and normal speech, and (b) nor-
mal speech and slow speech. . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 (a) Segment of a speech signal used from VOQUAL’03 speechdatabase
[57], (b) 12th order LP residual, (c) Hilbert envelope of the LP residual,
and (d) contour ofη extracted from the Hilbert envelope. . . . . . . . . 30
4.3 Overlapping segments of the Hilbert envelope of the LP residual in the
vicinity of epoch locations for (a) soft, (b) normal, and (c)loud utterances.
([63]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Distributions of loudness measure (η) and proposed measure (ηp) of per-
ceptual loudness for two male speakers, shown in (a), (c) and(b), (d),
respectively. In each case, the solid (‘—’), the dashed (‘- --’), and the
dotted (‘· · · ’) lines correspond to normal, fast, and slow utterances, re-
spectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Variation of loudness measure (η) with speaking rate. (a) and (b) show the
results of intra-class and inter-class comparisons, respectively, for fast vs
normal case. (c) and (d) show the intra-class and inter-class comparisons,
respectively, for slow vs normal case. . . . . . . . . . . . . . . . . . .33
x
4.6 Variation of proposed measure (ηp) of perceptual loudness with speaking
rate. (a) and (b) show the results of intra-class and inter-class compar-
isons, respectively, for fast vs normal case. (c) and (d) show the results of
intra-class and inter-class comparisons, respectively, for slow vs normal
case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Cumulative probability density function of difference between mean val-
ues of proposed loudness measure (ηp) of (a) fast utterances and normal
utterances, and (b) slow utterances and normal utterances.. . . . . . . . 36
5.1 Illustration of voiced-nonvoiced discrimination using zero frequency fil-
tered signal. (a) Segment of a speech signal, (b) zero frequency filtered
signal, (c) energy of the zero frequency filtered signal, and(d) binary
voiced-nonvoiced signal. . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Illustration of unvoiced-silence discrimination using output of the res-
onatorH(z) (equation 5.3). (a) Segment of a speech signal, (b) output
of the resonatorH(z), (c) energy of the filtered signal, and (d) binary
unvoiced-silence signal. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Scatter plots of duration modification factors of a region vs the corre-
sponding utterance. (a), (b) and (c) show the scatter plots of αv vsαS, αu
vsαS, andαp vsαS, respectively in normal to fast conversion. Similarly,
(d), (e) and (f) show the scatter plots in normal to slow conversion. In
each plot, straight line that fits the cluster is shown by a solid line. . . . 43
xi
Chapter 1
Introduction
Information conveyed through a speech signal can be classified as linguistic, paralinguis-
tic, and extralinguistic [1]. Linguistic information is related to the language associated
with speech. Paralinguistic information in speech signal is nonlinguistic and nonverbal.
It refers to the information about speaker’s current attitudinal and emotional state. It is
similar to the linguistic information in the sense that ‘both the speaker and the listener are
aware of the intended message (communicative behavior)’. It is dissimilar to the linguistic
information in the sense that ‘it is not necessarily obviousto all human perceivers on some
universal basis. It is particular to the culture of the speaker, and its conventional interpre-
tation must be learned’. Extralinguistic information refers to the information present in
speech signal, ‘of which the speaker is not aware but the listener perceives it (informa-
tive behavior)’ [1]. Identity of the speaker and habitual factors such as voice quality, are
some of the ingredients that constitute the extralinguistic information. Advances in the
fields of automatic speech recognition and speech synthesishave helped in estimating
and reproducing the linguistic information present in natural speech. However, extraction
of the paralinguistic and the extralinguistic informationfrom acoustic speech signal is a
challenging task.
Speaking rate is one such paralinguistic feature, which hasan important role in hu-
man speech communication. It is a quantity which is proportional to the speed at which
the speaker produces speech. Speaking rate is a characteristic of both the speaker and
1
language, and can be changed consciously or subconsciouslyby the speaker. Deviation
from a normal speaking rate can easily be perceived by the listeners. The objective of this
work is to analyze the variations of acoustic features with changes in speaking rate, and
incorporate these variations in synthesis of speech at different speaking rates.
In order to emphasize the significance of speaking rate in human-to-human speech
communication, instances that are responsible for speaking rate change are briefly de-
scribed in section 1.1. In section 1.2, a study on the variations of speaking rate in human-
to-human speech and a study on the effect of speaking rate variations on speech systems.
1.1 Sources of speaking rate variation
It was shown by several studies that emotional state of the speaker influences the speaking
rate [2, 3]. It was observed that emotions like anger, fear, rage and happiness are asso-
ciated with high speaking rates, and emotions like boredom,sadness, sorrow and grief
are associated with low speaking rates. Speaking rate was observed to correlate strongly
with the activeness of the speaker, i.e. the more active the speaker is, the higher is his/her
speaking rate [4, 5]. Similar to emotion, stress was observed to influence the speaking rate
[6]. It was observed that high cognitive workload leads to a faster speaking rate [6]. Read
speech can be seen as a speech mode where the ideas to be expressed are completely pre-
pared and formulated before speech production starts. In contrast, in spontaneous speech
the formulation process and speech production are simultaneous. It was observed that
speaking rate is higher in read speech than in spontaneous speech [7]. The differences
in speaking rates of prose and poetry was observed in [8]. Thereadings of prose was
observed to have high speaking rates than poetry. The effect of speaking rate on the per-
ception of competence and benevolence of the speaker is analyzed in [9]. It was observed
that nonnormal (fast or slow) speaking rates are associatedwith competence and normal
speaking rate is associated with benevolence [5].
Frequently speakers unconsciously adapt their speaking rate to their dialogue partner.
It was observed that the information processing ability of the dialogue partner influences
the speaking rate [10]. For example, in a conversation involving infants, elderly peo-
2
age gender
attitude/competence
cultural background
discourse structureword frequency
emotion
stress
spoken text type
habitual speech rate
language proficiency
dialogue partner
Speaking rate
speech plannig
information structure
speech & hearing impairments
Fig. 1.1: Sources of speaking rate variation [13]
ple and persons with hearing difficulties, the speakers tend to reduce the speaking rate.
Speech in noisy conditions is less intelligible than in normal conditions. It was observed
that listeners prefer slow speech than fast speech in noisy conditions [11]. The language
proficiency of the listener was observed to affect the perception of speaking rate. It was
observed that learners of a foreign language feel native speakers use an exceptionally
rapid speaking rate [12]. Along with the above mentioned causes, the speaking rate was
observed to be influenced by various other factors like age, gender, speech, hearing im-
pairments etc. A comprehensive depiction of the sources of speaking rate variation is
shown in Fig. 1.1. Examples presented in this section illustrate the great range of situa-
tions and conditions in which speaking rate change can occur.
1.2 Spontaneous speech - a case study
Speech produced during human-to-human conversation is rich in terms of paralinguistic
information. Even a small variation from an expected pronunciation is an information
bearing element. Most of the speech systems assume that the users articulate clear, gram-
3
matically correct utterances, with orthodox pronunciation. When such a system is ex-
posed to human-to-human speech (spontaneous speech in the present context) the perfor-
mance greatly degrades. Also, development of systems that can tackle human-to-human
speech is desirable because of the availability of large amount of spontaneous speech
(which are present in forms such as lecture videos, telephone conversations and news sto-
ries). It was observed that speaking rate variation is an important factor that is responsible
for the degradation in the performance of automatic speech recognition (ASR) systems
[14]. In this section, variation of speaking rate in spontaneous speech is studied and a
review of the studies on the effect of speaking rate on performance of speech systems
(automatic speech recognition systems in particular), is presented.
1.2.1 Speaking rate variation in spontaneous speech
In order to study the variations of speaking rate in spontaneous speech, speech extracted
from class room lectures was chosen. The speech signals werecollected from four speak-
ers using a close speaking microphone and were sampled at 16000 Hz. The total duration
of the speech signals is six hours, where all speakers contribute equally.
General measurement methods
There have been two major approaches in measuring speaking rate. Each has its advan-
tages and limitations. The first represents the use of discrete categorization such as fast,
normal and slow to describe the speaking rate [15]. Such perceptually chosen classes
have been used in applications such as acoustic model selection [16, 17] and HMM nor-
malization [18] in ASR. Even though it matches human intuition, the boundaries between
these three categories are fuzzy. Most of the time, human knowledge is required to set the
boundaries, and hence it is difficult to devise a completely automated engineering solu-
tion. In the second approach, speaking rate is measured in a quantitative way by counting
the number of phonetic elements per second. Words, syllables [16], stressed syllables,
and phonemes [19] are all possible candidates. Syllables are a popular choice because
of the robustness in estimation of the syllable boundaries,and close resemblance to the
4
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
yσ=0.73
(a)
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
y
σ=0.66
(b)
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
y
σ=0.82
(c)
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
y
σ=0.70
(d)
Fig. 1.2: Mean subtracted histograms of syllable rate for spontaneous speech.
speech production characteristics [16, 20, 21]. In the present study, number of syllables
per second was chosen as an quantitative metric to measure speaking rate. Identifica-
tion of syllable boundaries was performed using syllable detection algorithm proposed by
Mermelstein [22].
Speaking rate variation
The variations of syllable rate (τ) were analyzed to illustrate the variations of speaking rate
in spontaneous speech. Histograms of syllable rate for two types of speech, namely, (i)
spontaneous speech and (ii) read speech were computed. Fig.1.2 shows the histograms
5
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
yσ=0.51
(a)
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
y
σ=0.21
(b)
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
y
σ=0.44
(c)
−3 −2 −1 0 1 2 3 0
0.1
0.2
0.3
0.4
τ−µτ
Nor
mal
ized
freq
uenc
y
σ=0.33
(d)
Fig. 1.3: Mean subtracted histogram of syllable rate for read speech.
corresponding to spontaneous speech of four different speakers. Each plot in Fig. 1.2
corresponds to one speaker. Similarly, Fig. 1.3 shows the histograms corresponding to
read speech of four speakers. All the histograms are mean subtracted in order to preserve
emphasis on the variations rather than on the absolute valueof the syllable rate. From
Figs. 1.2 and 1.3, it can be observed that the histograms corresponding to spontaneous
speech are more broader than the histogram corresponding toread speech. This can be
observed from the values of standard deviation (σ) for all the histograms. The values of
σ corresponding to spontaneous speech are greater than that of read speech for all the
speakers. From these observations it can be inferred that variations of speaking rate are
higher in spontaneous speech than in read speech.
6
1.2.2 Effect on the performance of speech systems
In most of speech systems, the models developed during the training phase are built us-
ing speech data, which is well articulated, at an almost constant speaking rate, without
any mispronunciations. Examples of such databases which are widely used in building
speech systems like ASR and speaker recognition are, TIMIT [23], CMU-ARCTIC [24],
etc. When such a system is exposed to spontaneous speech, theperformance degrades
significantly [14]. This section reviews the studies on the effect of speaking rate on the
performance of speech systems. Sielgar et. al [21] studied the relation between word
error rate of an ASR system and speaking rate variations. Three different speaking rate
metrics were used:
(i) Word rate, defined as number of words per minute.
(ii) Phone rate, defined as number of phones per second.
(iii) Phone rate percentile, defined as the cumulative distribution function of the ob-
served phone duration [21].
The database used was Wall Street Journal corpus [25]. It wasobserved that word error
rate increased significantly when the word rate was greater (or lesser) than the mean word
rate by more than two standard deviations. Similarly, it wasobserved that when phone
rate was greater (or lesser) than the mean phone rate by more than one standard deviation,
then the word error rate increased significantly. Word recognition error for a large vocab-
ulary continuous speech recognition (LVCSR) system for fast speakers was analyzed in
[26]. It was observed that error rate increased by two to three folds for the speaker with
highest speaking rate. The possible causes can be inherent spectral differences, phone
omission and duration reduction [26]. These studies showedthat changes in speaking rate
impacts the performance of speech systems. Therefore, careful adaption to the changes
in speaking rate is essential to the performance of the speech systems like speech recog-
nition and speaker recognition. Also adapting the speakingrate to the listeners choice or
convenience improves the naturalness of speech synthesis systems.
7
1.3 Objective and scope of the work
The objective of this work is to analyze the variations of acoustic features with speak-
ing rate. The acoustic features related to excitation source are used. In order to analyze
the variations, speech utterances are collected from various speakers at different speaking
rates, namely, fast, normal and slow. The variations of the acoustic features corresponding
to speech at nonnormal (fast or slow) speaking rates are analyzed relative to the speech
at normal speaking rate. The effect of speaking rate on the perception of loudness is ana-
lyzed by conducting a perceptual loudness test, and analyzing the variations of an acoustic
loudness measure. The observed variations are incorporated in a uniform duration modi-
fication method, to synthesize speech at different speaking rates.
1.4 Organization of the thesis
The contents of this thesis are organized as follows:
In chapter 2, we review the studies on the effect of speaking rate variations on acoustic
features.
In chapter 3, we analyze the variations of two features related to excitation source at
different speaking rates. We observe and quantify the difference between the distributions
of the two features.
In chapter 4, we analyze the effect of speaking rate on the perception of loudness. A
perceptual loudness test on speech at different speaking rates was conducted. The results
from the perceptual tests are compared with variations of a loudness measure extracted
from speech signals.
In chapter 5, a method to synthesize speech at different speaking rates is proposed.
Variations in the acoustic features and durations of some ofthe sound units are incorpo-
rated in a duration modification method.
In chapter 6, we summarize the contributions of the present work, and highlight some
issues arising out of the study.
8
Chapter 2
Speaking rate - A Review
Depending on the type of data used for analysis, studies on the effects of speaking rate
on speech production mechanism can be broadly classified into two categories: (1) Stud-
ies based on articulatory dynamics, and (2) studies based onfeatures derived from the
acoustic speech signal.
2.1 Studies based on articulatory dynamics
In order to produce a particular type of speech sound, the articulators should follow a
particular sequence of movements. Changing the speaking rate involves producing an
acoustic output with the same linguistic information, in a shorter/longer duration. When
the speaking rate is changed, the articulators follow an almost similar path, but in a
shorter/longer duration. In order to accommodate both the durational changes and lin-
guistic information, as mentioned in [27], the speaker ‘mayvary the spatial magnitude of
articulatory movements [28, 29, 30, 31], or may adjust the speed of transition between
successive targets [32, 33], or may modify the overlap between successive articulatory
gestures by modifying their phrasing [34, 35, 36]’. ‘These changes may not be mutually
exclusive, and can interact with each other’ [37, 33, 38]. The effect of speaking rate vari-
ations on the dynamics of the articulators of the vocal tractwas studied extensively in
literature [39, 40, 41, 42]. The effect of speaking rate on tongue movements was studied
9
using electromyograph (EMG). It was observed that the EMG activity associated with
the tongue body movements during vowel production decreased during fast speech, while
the activity associated with the production of both labial and alvelor stop consonants in-
creased with increase in the speaking rate [40, 43]. The decrease in the activity implies
either a decrease in the articulatory displacement, or a decrease in the speed of the ar-
ticulatory movements, or both [40, 44]. The dynamics of six articulators (jaw, both lips,
tongue tip, blade, and dorsum) were studied in [41] for different speaking rates. The ar-
ticulatory data was acquired using an electromagnetic midsagittal articulograph (EMA).
Close examination of the EMA data showed that the shapes of the articulatory trajecto-
ries became more complex in the case of slow speech [41]. Thisimplies that articulatory
trajectories are partially influenced by the speaking rate.It was showed that the tongue
diverged less from the ‘centroid’ or ‘rest’ position at fastspeaking rates than at normal
speaking rates [42]. Variation in the dynamics of the articulators with speaking rate was
observed to be dependent on the type of the articulator [42].It was observed that the
variation in the tongue movement was greatest of all the articulatory movements [42].
The above mentioned studies showed that speaking rate affects the dynamics of the
articulators. However, in practice, acquisition of data related to articulatory movements is
difficult. Hence, the analysis of speaking rate using the acoustic speech signal is preferred.
2.2 Studies based on acoustic features
2.2.1 Suprasegmental level
Suprasegmental features for studying the effect of speaking rate are related to discourse
prosody, pauses, and pitch accents. The features related todiscourse prosody are sizes
(measured in terms of number of syllables) and durations of syllables (SYLL), prosodic
word (PW), minor phrase (MIP), intonation phrase (IP), prosodic group (PG), turn (TN),
and discourse (DI) [45]. Studies showed that the durations of PW were significantly
different for three speaking rates (slow, normal, and fast), while the durations of other
discourse prosody features changed very little. It was observed that the sizes of the IP
10
and MIP were affected by speaking rate, whereas the sizes of PW changed very little with
the speaking rate [45]. It was shown that the speaking rate affected the characteristics of
pauses significantly [46, 47]. These studies showed that thenumber of pauses increased
when the speaking rate decreased. It was also observed that changes in the durations of
the pauses were speaker-dependent. It was showed that therewas no significant change
in the average duration of pauses among different speakers [47]. Effect of speaking rate
on the number of prosodic breaks and characteristics of pitch accents (number and type)
was studied in [46]. It was observed that fast speech had lessnumber of prosodic breaks
compared to normal and slow speech. Observation of the characteristics of pitch accents
showed that the number of pitch accents was low for fast speech and high for slow speech.
Also, fast speech showed more monotonal characteristic than bitonal, and slow speech
showed more bitonal characteristic. This may be due to simplicity of the monotonal
speech compared to bitonal speech.
2.2.2 Segmental level
Segmental features observed for studying the effects of speaking rate include voice onset
time (VOT) [48, 49, 50, 51, 52], durations of different sound units [53, 34], and spectral
features [39]. Studies showed a systematic increase/decrease in syllable durations when
the speaking rate was decreased/increased [53]. Changes in the syllable duration are
due to changes in the duration of the vowel and that of the VOT of the syllable. This
observation is consistent among many studies. Some studiesobserved that the amount
of increase/decrease in VOT was speaker-specific [52]. Variation of durations of vowels,
consonants, and transition regions was studied in [39]. Of the three types, the durations
of vowels showed the most change. It was observed that the variation of duration of
transition regions with the speaking rate in a consonant-vowel (CV) context depended
on the type of the consonant, and that it was directly proportional to the variation in the
duration of the syllable [49]. Variation of transition durations of two syllables/ba/ and
/wa/ was analyzed with speaking rate changes [49]. Change in transition duration of the
syllable/wa/ was greater than that in the syllable/ba/. Increase in the syllable duration
of /ba/ was almost entirely due to increase in the post-transition region, whereas for/wa/
11
the increase in the syllable duration was also due to an increase in the transition duration
[49]. Variation of durations of vowels, pauses, consonants, and transition regions for
Hindi language was studied in [54]. It was observed that the durations of vowels and
pauses varied significantly, whereas the durations of consonants and transition regions
varied very little. Formant frequencies have also been observed for studying the effects
of speaking rate. Studies showed that formant frequencies in stable regions were not
affected much by the speaking rate. But the formant frequenciesin the transition regions
(i.e., onset frequency of the formant transition) varied significantly with speaking rate
[39]. It was shown that theF2 onset frequency (in CV context) was closer to the vowel
midpoint frequency in the fast speech than in the slow speech[39]. The rate of change of
F2, computed asF2mid−F2onset
T , whereT is the duration of the transition region, was observed
to change significantly with speaking rate, although the amount of change was observed
to be speaker-dependent.
Speaking rate changes the acoustic features of production nonuniformly at various
levels. These changes occur at the subsegmental level (lessthan a pitch period) also,
which is mostly due to excitation of the vocal tract system. For producing natural sound-
ing synthetic speech at different speaking rates, the overall duration, and the segmental
level variations need to be captured and incorporated during synthesis [55, 54]. Like-
wise, variations of the fundamental frequency and changes in the formants of sound units
also need to be incorporated during the synthesis. In addition, changes in the excitation
characteristics at the subsegmental level may also influence the quality of the synthesized
speech.
12
Chapter 3
Effect of speaking rate on excitation
source features
In this chapter, we present the study on effect of speaking rate on features related to ex-
citation source. Two acoustic features related to excitation source are estimated from
speech signal. This is achieved by removing the effect of vocal tract system on the speech
signal. Intra speaker variations in the excitation source features, caused by the speaking
rate changes are analyzed. In section 3.1, we discuss the method of extraction of the exci-
tation source features from speech signal. In section 3.2, description of speech utterances
recorded at various speaking rates is discussed. Sections 3.3 and 3.4 discuss the variations
of the two excitation source features.
3.1 Extraction of excitation source features
Features of excitation source can be extracted from the speech signal by removing the
influence of the vocal tract system on the acoustic signal. Three features of excitation
source are used in this study. Two of them are obtained from the zero-frequency filtering
of the speech signal [56]. The third feature is derived from the Hilbert envelope of the lin-
ear prediction (LP) residual. The zero-frequency filtered signal and the Hilbert envelope
of the LP residual are obtained by processing the speech signal to reduce the influence of
13
0 0.1 0.2 0.3 0.4 0.5−1
0
1
(a)
0 0.1 0.2 0.3 0.4 0.5−400
0
400
(b)
0 0.1 0.2 0.3 0.4 0.5−0.2
1.5
(c)
0 0.1 0.2 0.3 0.4 0.50
15
30
(d)
0 0.1 0.2 0.3 0.4 0.50
100
200
(e)
Time (s)
(Hz)
Fig. 3.1: (a) Segment of a speech signal from VO-QUAL’03 [57] database, (b) zero-frequency filteredsignal, (c) differenced EGG signal and epoch loca-tions marked by arrows, (d) strength of excitationcalculated from the zero-frequency filtered signal,and (e) contour of instantaneous F0.
the vocal tract system. Extraction of the features related to zero-frequency filtering are
explained in this section, and extraction of the feature related to the Hilbert envelope of
the LP residual is discussed in Sec. 4.2
During the production of voiced speech, the excitation to the vocal tract system can be
approximated by a sequence of impulses of varying amplitudes. The effect of discontinu-
ity due to the impulse-like excitation is reflected across all the frequencies, including 0 Hz
or the zero frequency [56, 58]. The effect of discontinuity due to the impulse-like exci-
tation is clearly visible in the output of narrowband filtering (at any frequency) of speech
signal. The advantage of choosing a zero-frequency filter isthat the output is not affected
by the characteristics of the vocal tract system which has resonances at much higher fre-
quencies. Therefore the zero-frequency filtering helps in emphasizing the characteristics
of the excitation [56]. A zero-frequency resonator is an infinite impulse response filter
with a pair of poles located on the unit circle. A cascade of two such resonators is used
to provide sharper cut-off to reduce the effect of resonances of the vocal tract system.
The following steps are involved in processing speech signal to derive the zero-frequency
14
filtered signal [56, 58].
1. The speech signals[n] is differenced to remove any slowly varying component
introduced by the recording device.
x[n] = s[n] − s[n− 1] (3.1)
2. The differenced speech signalx[n] is passed through a cascade of two ideal zero-
frequency (digital) resonators. That is
y0[n] = −4∑
k=1
aky0[n− k] + x[n] (3.2)
wherea1 = −4, a2 = 6, a3 = −4 anda4 = 1. The resulting signaly0[n] grows
approximately as a polynomial function of time.
3. The average pitch period is computed using the autocorrelation function of 30 ms
segments ofx[n].
4. The trend iny0[n] is removed by subtracting the local mean computed over the
average pitch period, at each sample. The resulting signal
y[n] = y0[n] −1
2N + 1
N∑
m=−N
y0[n+m] (3.3)
is the zero-frequency filtered signal. Here 2N + 1 corresponds to the number of samples
in the window used for trend removal. The choice of the windowsize in not critical as
long as it is in range of one to two pitch periods. Fig. 3.1(b) shows the filtered signal
of the speech segment shown in Fig. 3.1(a). It was shown that the instants of positive-
to-negative zero crossings (PNZCs) correspond to the instants of significant excitation in
voiced speech, calledepochs[56]. The locations of PNZCs of the filtered signal are shown
in Fig. 3.1(c). There is close agreement between the locations of the strong positive peaks
of the differenced electroglottograph (DEGG) signal and the instantsof PNZCs derived
from the filtered signal. The instantaneous fundamental frequency (which is referred to
15
as instantaneousF0 in the present work) at each epoch is derived by computing therecip-
rocal of the time interval between the current epoch and the next epoch [59]. Fig. 3.1(e)
shows the instantaneousF0. Since the effect due to an impulse is spread uniformly across
the frequency range, the strength of impulses can be derivedfrom a narrowband around
the zero frequency. Hence the information about the strength of excitation can also be
derived from the zero-frequency filtered signal. It was observed that the slope of the
zero-frequency filtered signal around PNZCs gives a measureof strength of excitation
[60]. The slope is measured by computing the difference between the negative sample
value and positive sample value on either side of the epoch, and is denoted as strength of
excitation (ǫ) ([60]). Fig. 3.1(d) shows the plot ofǫ, derived from the filtered signal in
Fig. 3.1(b). The plot ofǫ shows a trend similar to the DEGG signal (Fig. 3.1(c)).
3.2 Speech material
The speech database used in the present study consists of 10 English sentences which are
chosen from TIMIT dataset [23]. Each sentence was uttered by25 male speakers at three
different speaking rates, namely, fast, normal, and slow. The speakers were undergraduate
and graduate students, aged between 20 and 30 years. All the speakers spoke Indian
English, and the native language of each speaker was one among Telugu, Hindi, Kannada,
and Tamil. The speakers were guided to listen to samples of fast and slow utterances
before they produced utterances at different speaking rates. The objective was to help the
speakers to produce speech at different speaking rates while maintaining naturalness of the
speech. Without the help of a reference, some speakers were unable to produce speech
at different speaking rates. A speaker with naturally slow speaking rate produced fast
speech, which was similar to the normal speech of other speakers. Similar behavior was
observed with speakers who had naturally fast speaking rates. A total of 750 utterances
(250 utterances for each speaking rate) were collected. Thespeech signals were sampled
at 8 kHz.
Analysis of the recorded utterances was done to determine whether the speakers were
able to produce speech at the three different speaking rates. Syllable rate, which is defined
16
0 2 4 6 8 10 120
0.2
0.4
0.6
Syllables/second
Nor
mal
ized
freq
uenc
y
fastnormalslow
Fig. 3.2: Distribution of the measure of speak-ing rate for fast, normal, and slow utterances.
as number of syllables per second, is used to measure the speaking rate. The syllable
rate of an utterance is estimated by computing the ratio of number of syllables in the
utterance and duration of the utterance. Since the text corresponding to the utterances is
available, the number of syllables in an utterance is obtained from the corresponding text.
Fig. 3.2 shows the distributions of syllable rates of fast, normal, and slow utterances. The
distributions show a significant change between the three speaking rates.
3.3 Variation of instantaneousF0
Fig. 3.3 shows the distributions of instantaneousF0 for four male speakers, chosen at
random from the set of 25 speakers. The distributions of the instantaneousF0 for four
speakers are examined to illustrate differences among individuals, indicating the speaker-
specific nature of these variations. It is observed that the distribution of instantaneousF0
does discriminate between fast, normal, and slow utterances of the speakers, although the
amount of discrimination is speaker dependent. For speakers in Figs. 3.3(a) and (b), there
is good discrimination between distributions of the instantaneousF0 for fast, normal, and
slow utterances. Discrimination can be observed from the mean of the distributions, and
from their spreads. For the speaker in Fig. 3.3(c) there is very little difference between
17
60 120 180 2400
0.2
0.4
0.6
0.8
F0(Hz)
Nor
mal
ized
freq
uenc
y (a)
60 120 180 2400
0.2
0.4
0.6
0.8
F0(Hz)
Nor
mal
ized
freq
uenc
y (b)
60 120 180 2400
0.2
0.4
0.6
0.8
F0(Hz)
Nor
mal
ized
freq
uenc
y (c)
60 120 180 2400
0.2
0.4
0.6
0.8
F0(Hz)
Nor
mal
ized
freq
uenc
y (d)
Fig. 3.3: Distributions of instantaneous F0 for 4 male speakers.In each case, the solid (‘—’), the dashed (‘- - -’), and the dotted(‘· · · ’) lines correspond to normal, fast, and slow utterances, re-spectively.
distributions of fast and normal utterances, but some discrimination between the distri-
butions of slow and normal utterances. The distributions shown in Fig. 3.3(d) are very
close to each other. Some speaker-specific characteristicscan be inferred from the distri-
butions shown in Fig. 3.3. For a speaker with a naturally fastspeaking rate, the distinction
between instantaneousF0 of his/her fast and normal speech will be less. Similarly, for
a speaker with a naturally slow speaking rate, the instantaneousF0 of slow and normal
speech are similar. Some speakers are able to produce speechat three different speaking
rates while maintaining intelligibility and naturalness.In most cases, speech uttered in
at least one of the nonnormal (fast or slow) speaking rates showed significant difference
from the speech uttered in normal and the other nonnormal (slow or fast) speaking rate.
In order to evaluate the variation of instantaneousF0 with speaking rate for all the
25 speakers in the dataset, Kullback-Leibler (KL) divergence [61] is used. When two
distributions are described by univariate Gaussian probability density functions, the KL
18
divergence between the two distributions is given by [62]
dKL(A,B) =12
{
σA2
σB2+σB
2
σA2
}
− 1
+12{µA − µB}
2
{
1σA
2+
1σB
2
}
, (3.4)
whereµA andσA denote the mean and the standard deviation, respectively, of the samples
in setA, while µB andσB denote the corresponding quantities for the samples in setB.
Also computed isµA − µB, which is the difference of the mean valuesµA andµB. In
this study, the samples in setsA andB are the values of instantaneousF0. Let us first
consider the case of fast and normal utterances. Consider the values of instantaneousF0
obtained from normal and fast utterances as two separate classes. When the values of
the instantaneousF0 in bothA andB are from either normal or fast utterances, then it
is the case of intra-class comparison. Likewise, inter-class comparisons are those where
the values of the instantaneousF0 in A andB are derived from the fast (normal) and
normal (fast) utterances, respectively, of a speaker. BothdKL(A,B) andµA − µB should
be less for the intra-class comparisons than for the inter-class comparisons. The ordered
pair (µA − µB, dKL(A,B)) is used to distinguish between normal and fast utterancesof a
speaker, as described below.
LetN denote the set of values of instantaneousF0 of a given speaker, derived from the
10 utterances collected at normal speaking rate. LetN1 andN2 denote two distinct subsets
ofN , such that the values of instantaneousF0 in each subset are derived from 5 utterances
at normal speaking rate. For the same speaker, letF (S), F1 (S1), andF2 (S2) denote
the corresponding sets derived from the utterances at fast (slow) speaking rate. For each
speaker, the following ordered pairs are computed: (a) (µFi−µN j , dKL(Fi ,N j)), for i = 1, 2,
and j = 1, 2, (b) (µF −µN , dKL(F ,N)), (c) (µFi −µF j , dKL(Fi,F j)) for i = 1, 2, j = 1, 2, and
i , j, and (d) (µNi −µN j , dKL(Ni ,N j)) for i = 1, 2, j = 1, 2, andi , j. The ordered pairs in
(a) and (b) denote the inter-class comparisons within a speaker, while those in (c) and (d)
denote the intra-class comparisons within the speaker. Each ordered pair can be plotted
as a point in a two-dimensional plane. For each speaker 5 points are computed due to
19
−35 −20 0 20 35 0
1
2
3
4
µA−µ
B
d KL(A
, B)
(b)
−35 −20 0 20 35 0
1
2
3
4
µA−µ
B
d KL(A
, B)
(d)
−35 −20 0 20 35 0
1
2
3
4
µA−µ
B
d KL(A
,B)
(a)
−35 −20 0 20 35 0
1
2
3
4
µA−µ
B
d KL(A
,B)
(c)
Fig. 3.4: Variation of the instantaneous F0 with speaking rate.(a) and (b) show the results of intra-class and inter-class compar-isons, respectively, for fast vs normal case. (c) and (d) show theresults of intra-class and inter-class comparisons, respectively, forslow vs normal case.
inter-class comparisons and 2 points are computed due to intra-class comparisons (since
dKL(A,B)=dKL(B,A)). Thus, for 25 speakers, 125 points are computed due to inter-class
comparisons and 50 points are computed to intra-class comparisons. Figs. 3.4(a) and (b)
show the points corresponding to speaker-specific intra-class and inter-class comparisons,
respectively, in the case of fast vs normal utterances for 25male speakers. For comparison
of utterances recorded at slow and normal speaking rates, the following ordered pairs are
computed for each speaker: (a) ((µSi−µN j , dKL(Si,N j)), for i = 1, 2, j = 1, 2, (b) (µS−µN ,
dKL(S,N)), (c) (µSi − µS j , dKL(Si,S j)) for i = 1, 2, j = 1, 2, andi , j, and (d) (µNi − µN j ,
dKL(Ni ,N j)) for i = 1, 2, j = 1, 2, andi , j. The ordered pairs in (a) and (b) denote the
inter-class comparison points within a speaker, while those in (c) and (d) denote the intra-
class comparison points within the speaker. The slow vs normal intra-class and inter-class
comparison points are plotted in Figs. 3.4 (c) and (d), respectively.
It is observed from Fig. 3.4 that the points due to intra-class comparison (Figs. 3.4(a)
20
and (c)) are closer to the origin than the points due to inter-class comparison (Figs. 3.4(b)
and (d)). The inter-class comparison points have more spread than the intra-class com-
parison points. Most of the fast vs normal inter-class comparison points (Fig. 3.4(b)) lie
in the first quadrant (positive abscissa and positive ordinate), which implies thatµF > µN
in most cases. The spread of points shows that there is variation in the distributions of
instantaneousF0 of fast and normal speech. Unlike the fast vs normal inter-class compar-
ison points, the slow vs normal inter-class comparison points (Fig. 3.4(d)) are distributed
both in the first quadrant and the second quadrant (negative abscissa and positive ordi-
nate). Also slow vs normal inter-class comparison points have larger spread than slow vs
normal intra-class comparison points (Fig. 3.4(c)). Theseobservations imply that almost
all the speakers increase their instantaneousF0 when speaking fast, but when the speak-
ing rate is decreased, the instantaneousF0 increases for some speakers and decreases for
some others.
A cumulative probability density function (CPDF) is used toidentify the number of
cases in which instantaneousF0 increases (or decreases) when speaking rate is increased
(or decreased). Letf (x) is a CPDF computed from a set X. The percentage of samples
in X which are less than a numberk is given by f (k) × 100. The difference between
mean instantaneousF0 of fast utterance and mean instantaneousF0 of the corresponding
normal utterance is computed which is denoted as∆µFN. Therefore, a total of 250 values
are computed. A Gaussian probability density function is computed from the values of
∆µFN. The CPDF computed from the Gaussian probability density function is shown
in Fig. 3.5(a). It can be observed from Fig. 3.5(a) that CPDF is equal to 0.25 when
∆µFN = 0. This implies that for 25 % of the values∆µFN ≤ 0 and for 75 % of the values
∆µFN > 0. This observation shows that in 75 % of the cases instantaneousF0 increases
when speaking rate is increased and in 25 % of the cases instantaneousF0 decreases
when speaking rate is increased. Similarly, a CPDF is constructed from the values of
difference between mean instantaneousF0 of a normal utterance and the corresponding
slow utterance (denoted by∆µNS), which is shown in Fig. 3.5(b). It can be observed from
Fig. 3.5(b) that CPDF is equal to 0.5 when∆µNS = 0. This implies that form 50 % of
the values∆µNS ≤ 0 and for 50 % of the values∆µNS > 0. This observation shows that
in 50 % of the cases instantaneousF0 increases when speaking rate is increased and in
21
−40 −20 0 20 40 0
0.25
0.5
0.75
1
∆µFN
Cum
ulat
ive
prob
abili
ty d
ensi
ty fu
nctio
n (a)
−40 −20 0 20 40 0
0.25
0.5
0.75
1
∆µSN
Cum
ulat
ive
prob
abili
ty d
ensi
ty fu
nctio
n (b)
Fig. 3.5: Cumulative probability density function of difference be-tween mean values of instantaneous F0 of (a) fast utterancesand normal utterances, and (b) slow utterances and normal ut-terances.
the other 50 % of the cases instantaneousF0 decreases when speaking rate is increased.
This clearly shows the speaker-specific nature in the changeof instantaneousF0 when
speaking rate is decreased.
This observation has implication in synthesis of speech at different speaking rates,
because unlike duration, where increase (decrease) in the speaking rate results in a corre-
sponding decrease (increase) in the duration, the instantaneousF0 doesn’t follow specific
trend. For synthesizing speech at different speaking rates, the instantaneousF0 has to be
modified suitably.
3.4 Variation of strength of excitation
Fig. 3.6 shows the distributions ofǫ for the four male speakers (same as used in Fig. 3.3).
From Fig. 3.6 it is observed thatǫ does vary with speaking rate. The degree of variation is
speaker-dependent. The general trend observed across all the speakers is that ¯ǫ (where ¯ǫ
denotes the mean value ofǫ) of fast speech is less than that of normal and slow speech. For
some speakers, ¯ǫ of normal speech is less than that of slow speech (Figs. 3.6(b) and (d)),
whereas for some others ¯ǫ of slow speech is less than that of normal speech(Figs. 3.6(a)
and (c)). Note that this is a speaker-specific property. The spread of distribution of the
22
0 5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
εN
orm
aliz
ed fr
eque
ncy (a)
0 5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
ε
Nor
mal
ized
freq
uenc
y (b)
0 5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
ε
Nor
mal
ized
freq
uenc
y (c)
0 5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
ε
Nor
mal
ized
freq
uenc
y (d)
Fig. 3.6: Distributions of strength of excitation (ǫ) for 4 malespeakers. In each case, the solid (‘—’), the dashed (‘- - -’), andthe dotted (‘· · · ’) lines correspond to normal, fast, and slow utter-ances, respectively.
values ofǫ for fast speech is less than that of normal and slow speech. This implies that
the variation ofǫ is less for fast speech compared to slow and normal speech. The simi-
larity between the distributions ofǫ of slow and normal speech is more than the similarity
between the distributions ofǫ of fast and normal speech. For speakers in Figs. 3.6(a), (b)
and (c) there is good discrimination between the distributions ofǫ of fast, normal, and
slow speech, but, for the speaker in Fig. 3.6(d) the discrimination is very less. In order to
evaluate the variation ofǫ with speaking rate, the KL divergence is used. The evaluation
procedure is similar to the one used in the case of instantaneousF0, discussed in Sec. 3.3.
Figs. 3.7(a) and (b) show the points corresponding to speaker-specific intra-class and
inter-class comparisons, respectively, for the case of fast vs normal utterances of all the 25
male speakers. In a similar fashion, the speaker-specific intra-class and inter-class com-
parison points are computed for slow vs normal utterances, and are plotted in Figs. 3.7(c)
and (d). It is observed from Fig. 3.7 that the intra-class comparison points (Figs. 3.7(a)
and (c)) are closer to the origin than the inter-class comparison points (Figs. 3.7(b) and
23
−20 −10 0 10 20 0
4
8
12
µA−µ
B
d KL(A
, B)
(b)
−20 −10 0 10 20 0
4
8
12
µA−µ
B
d KL(A
, B)
(d)
−20 −10 0 10 20 0
4
8
12
µA−µ
B
d KL(A
,B)
(a)
−20 −10 0 10 20 0
4
8
12
µA−µ
B
d KL(A
,B)
(c)
Fig. 3.7: Variation of the strength of excitation (ǫ) with speakingrate. (a) and (b) show the results of intra-class and inter-classcomparisons, respectively, for fast vs normal case. (c) and (d)show the results of intra-class and inter-class comparisons, re-spectively, for slow vs normal case.
(d)). The inter-class comparison points have more spread than the intra-class comparison
points. Most fast vs normal inter-class comparison points (Fig. 3.7(b)) lie in the second
quadrant, which implies that the average value ofǫ is lower for fast speech compared to
normal speech, for most of the speakers. Unlike the fast vs normal inter-class comparison
points, the slow vs normal inter-class comparison points (Fig. 3.7(d)) lie in both the first
and the second quadrants. This implies that the trend observed for ǫ in slow vs normal
case is different from that observed in fast vs normal case.
A cumulative probability density function (CPDF) is used toidentify the number of
cases in whichǫ decreases (or increases) when speaking rate is increased (or decreased).
Similar to the case of instantaneousF0 (shown in Sec. 3.3), two CPDFs are computed
from the difference between values of ¯ǫ of a fast utterance and the corresponding normal
utterance (denoted by∆µFN), and values of ¯ǫ of a normal utterance and the corresponding
slow utterance. Fig. 3.8(a) and (b) shows the CPDFs constructed from the values of
24
−40 −20 0 20 40 0
0.25
0.5
0.75
1
∆µFN
Cum
ulat
ive
prob
abili
ty d
ensi
ty fu
nctio
n (a)
−40 −20 0 20 40 0
0.25
0.5
0.75
1
∆µSN
Cum
ulat
ive
prob
abili
ty d
ensi
ty fu
nctio
n (b)
Fig. 3.8: Cumulative probability density function of difference be-tween mean values of strength of excitation at epoch (ǫ) of (a)fast utterances and normal utterances, and (b) slow utterancesand normal utterances.
∆µFN and∆µNS. It can be observed from Fig. 3.8(a) that CPDF is equal to 0.75when
∆µFN = 0. This implies that for 75 % of the values∆µFN ≤ 0 and for 25 % of the values
∆µFN > 0. This observation shows that in 75 % of the casesǫ decreases when speaking
rate is increased and in 25 % of the casesǫ increases when speaking rate is increased.
Similarly, from the CPDF constructed from the values of∆µNS (shown in Fig. 3.8(b)), it
can be observed that CPDF is equal to 0.5 when∆µNS = 0. This implies that form 50 %
of the values∆µNS ≤ 0 and for 50 % of the values∆µNS > 0. This observation shows that
in 50 % of the casesǫ increases when speaking rate is decreased and in the other 50% of
the casesǫ decreases when speaking rate is decreased. This observation matches with the
observations made in the case of the instantaneousF0.
3.5 Summary
In this chapter, we presented the study on the effect of speaking rate on two excitation
source features namely: (i) instantaneous fundamental frequency and (ii) strength of ex-
citation at epoch. Both the instantaneousF0 and strength of excitation at epoch (ǫ) are
estimated from the speech signal by passing through a zero frequency resonator. Observa-
tions of normalized frequency distributions of instantaneousF0 for various speaking rates
showed that, when speaking rate is increased almost all the speakers increased their in-
25
stantaneousF0, whereas the change in instantaneousF0 when speaking rate is decreased
is speaker dependent. Observations of normalized frequency distributions ofǫ for various
speaking rates showed that when speaking rate is increasedǫ decreases for most of the
speakers, but when speaking rate is decreased the change inǫ is speaker dependent. In
order to generalize these observations for large number of speakers, KL-divergence was
used. The utterances corresponding to a speaker were grouped into fast, normal and slow
classes. Intra-class and inter-class comparison points between a nonnormal (fast or slow)
speaking rate and normal speaking rate were computed. Variations of intra-class and
inter-class comparison points seemed to correlate well with the observations. A cumula-
tive probability density function was used to quantify the number of instances in which
instantaneousF0 increases or decreases when speaking rate is changed. When speaking
rate is increased, in 75 % of the cases instantaneousF0 was observed to increase and in
25 % of the cases instantaneousF0 was observed to decrease. Whereas, when speaking
rate is decreased in 50 % of the cases instantaneousF0 was observed to increase and in 50
% of the cases instantaneousF0 was observed to decrease. Similar results were observed
in the case of strength of excitation of epoch.
26
Chapter 4
Effect of speaking rate on loudness
Loudness is an important feature of voice-quality present in human speech. Variation
in the degree of loudness conveys nonlinguistic information like emotional state of the
speaker and emphasis on particular regions in speech utterances. In general we perceive
fast speech is louder than slow speech. In this chapter, we study the effect of speaking
rate on the perception of loudness. This is performed by conducting a perceptual loudness
tests and analyzing the variations of an objective loudnessmeasure extracted from speech
signals.
In section 4.1, we describe the perceptual loudness test. Sections 4.2 and 4.3 de-
scribes the extraction method and presents the study on the variation of loudness measure
extracted from speech signal, respectively. In section 4.4, a modified loudness measure is
proposed and its variations with speaking rate is analyzed.
4.1 Perceptual evaluation of loudness of speech at differ-
ent speaking rates
Perceptual evaluation of loudness was carried out by conducting subjective tests with 6
listeners in the age group of 20-23 years. The tests were conducted in a laboratory en-
vironment by playing the speech signals through headphones. For perceptual evaluation,
27
a subset of the database (described in Sec. 3.2) was chosen. The subset contains speech
utterances spoken by 6 male speakers (i.e., 60 utterances ateach speaking rate). Two
speech files, one at a normal speaking rate and the other at a fast speaking rate, were
played out in succession. Both the utterances contained thesame sentence, and were
spoken by the same speaker. The listeners were asked to mark ‘F’ (or ‘N’), if they per-
ceived the utterance spoken at fast (or normal) speaking rate as the louder of the two. If
the listeners were unable to distinguish between the loudness of the two utterances in the
pair, they were asked to mark ‘X’. Sixty pairs of utterances containing fast and normal
utterances were used for listening. The same procedure was used to compare the loudness
of utterances spoken at normal and slow speaking rates. In this case, the listeners were
asked to mark the louder one as ‘S’ or ‘N’ corresponding to slow or normal speaking rate,
respectively. Figs. 4.1(a) and (b) show the results of perceptual evaluation.
From Figs. 4.1(a) and (b), it is observed that the listeners were able to distinguish
between the loudness of fast and slow speech when compared with normal speech (per-
centages of ‘X’ in Figs. 4.1(a) and (b) are 21 and 19, respectively). In the case of fast
vs normal speech, a significant number of utterances of fast speech have been marked as
louder (in 73 % of the cases as shown in Fig. 4.1(a)). By contrast, the loudness scores of
normal and slow utterances in the case of normal vs slow speech are close to each other
(43 % and 33 %, respectively in Fig. 4.1(b)). This study showsthat fast speech sounds
louder than normal speech in most cases, whereas the evidence is insufficient to show that
normal speech is louder than slow speech. This implies that perceptually, fast speech is
louder than normal speech. Also a change in the perception ofloudness takes place when
the speaking rate is changed (because of low percentage of ‘X’ in Figs. 4.1(a) and (b)).
The high loudness scores for fast speech can either be due speech production character-
istics or speech perception characteristics. An analysis of the cause of this behavior is
presented in the next section.
28
F N X0
20
40
60
80
100(a)
N S X0
20
40
60
80
100(b)
Fig. 4.1: Evaluation of the effect of speaking rate on the percep-tion of loudness. Results of comparison of (a) fast speech andnormal speech, and (b) normal speech and slow speech.
4.2 Extraction of loudness measure from the Hilbert en-
velope of the LP residual
The strength of the impulse-like excitation (also called the strength of excitation in ([63]))
is expressed in terms of the loudness measure defined byη = σµ. Hereµ denotes the mean
of the samples of the Hilbert envelope (HE) of the LP residualin a short interval around
the instants of significant excitation, andσ denotes the standard deviation of the samples
of the HE ([63]). Note that this loudness measure (η) is different from the strength of
excitation (ǫ) at an epoch defined in Sec. 3.4. The Hilbert enveloper[n] of the LP residual
e[n] is given by
r[n] =√
e2[n] + e2H[n], (4.1)
whereeH[n] denotes the Hilbert transform ofe[n]. The Hilbert transformeH[n] is given
by
eH[n] = IFT(EH(ω)), (4.2)
where IFT denotes the inverse Fourier transform, andEH(ω) is given by ([64])
EH(ω) =
+ jE(ω), ω ≤ 0
− jE(ω), ω > 0.(4.3)
29
0 0.1 0.2 0.3 0.4 0.5−1
0
1
(a)
0 0.1 0.2 0.3 0.4 0.5−1
0
1
(b)
0 0.1 0.2 0.3 0.4 0.50
0.5
1
(c)
0 0.1 0.2 0.3 0.4 0.50
1
2
(d)
Time(s)
Fig. 4.2: (a) Segment of a speech signal used from VOQUAL’03speech database [57], (b) 12th order LP residual, (c) Hilbert en-velope of the LP residual, and (d) contour of η extracted from theHilbert envelope.
Here E(ω) denotes the Fourier transform of the signale[n]. Fig. 4.2(c) shows the
Hilbert envelope of the LP residual (Fig. 4.2(b)) of the speech segment shown in Fig. 4.2(a).
A 12th order LP analysis is performed on each frame of 20 ms with a frame shift of 5 ms
to compute the LP residual. The sampling frequency of the signal is 8 kHz. The impulse-
like feature of excitation can be observed clearly from the Hilbert envelope of the LP
residual. Comparative analysis of soft, normal, and loud speech was made and reported
in [63]. The Hilbert envelope of the LP residual around the epoch locations is sharper in
the case of loud speech, compared to soft and normal speech. This behavior is illustrated
in Fig. 4.3, where the plots are obtained by overlapping the short segments of the Hilbert
envelope of the LP residual around each epoch location. The sharpness of the Hilbert
envelope of the LP residual can be captured by the parameterη, which is computed using
a 3 ms segment of the Hilbert envelope around each epoch. The mean value ofη derived
from the utterances of loud speech was observed to be greaterthan that of normal and soft
speech. Fig. 4.2(d) illustrates the contour of values ofη for the segment of speech signal
shown in Fig. 4.2(a).
30
0.5 1 1.5 2 2.50
0.10.20.30.40.50.60.70.80.9
1
Time (ms)
Am
plitu
de o
f HE
(a)
0.5 1 1.5 2 2.50
0.10.20.30.40.50.60.70.80.9
1
Time (ms)
Am
plitu
de o
f HE
(b)
0.5 1 1.5 2 2.50
0.10.20.30.40.50.60.70.80.9
1
Time (ms)
Am
plitu
de o
f HE
(c)
Fig. 4.3: Overlapping segments of the Hilbert envelope of the LP residual in the vicinity of epochlocations for (a) soft, (b) normal, and (c) loud utterances. ([63])
4.3 Variation of loudness measure
Figs. 4.4(a) and (c) show the distributions ofη for two male speakers. It is observed from
Figs. 4.4(a) and (c) thatη varies very little with speaking rate, even though the perceptual
loudness scores show that fast speech is perceived to be louder than normal speech. In
order to evaluate the variations ofη with speaking rate, we follow an analysis procedure
similar to the one used in the cases of instantaneousF0 andǫ. Figs. 4.5(a) and (b) show the
points corresponding to speaker-specific intra-class and inter-class comparisons, respec-
tively, for the case of fast vs normal utterances for 25 male speakers. Likewise, Figs. 4.5(c)
and (d) show the speaker-specific intra-class and inter-class comparisons, respectively, in
the case of slow vs normal utterances. The intra-class and inter-class comparison points
(Figs. 4.5(a) and (b), respectively) in the case of fast vs normal utterances have almost
equal spread. The same behavior is observed in the case of slow vs normal utterances.
But the perception studies in Sec. 4.1 indicated that fast speech is perceived to be louder
than normal speech in most cases. This implies that the loudness measure (η) doesn’t
change significantly when the speaking rate is changed, and that the change in the per-
ception of loudness may be caused by other factors.
31
0 0.5 1 1.50
0.1
0.2
0.3
0.4
0.5
η
Nor
mal
ized
freq
uenc
y (a)
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
ηp
Nor
mal
ized
freq
uenc
y (b)
0 0.5 1 1.50
0.1
0.2
0.3
0.4
0.5
η
Nor
mal
ized
freq
uenc
y (c)
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
ηp
Nor
mal
ized
freq
uenc
y (d)
Fig. 4.4: Distributions of loudness measure (η) and proposedmeasure (ηp) of perceptual loudness for two male speakers,shown in (a), (c) and (b), (d), respectively. In each case, the solid(‘—’), the dashed (‘- - -’), and the dotted (‘· · · ’) lines correspond tonormal, fast, and slow utterances, respectively.
4.4 Proposed measure of perceptual loudness
A parameter which measures the perception of loudness in thecase of speech at different
speaking rates is proposed in this section. Speech at higherspeaking rates is observed to
have higher instantaneousF0 than that of the normal and slow speaking rates. The instan-
taneousF0 at an epoch is defined as the reciprocal of the duration between two successive
epochs. Perception of loudness of the output signal depends, along with other factors, on
the energy emitted per unit time. The energy per unit time canbe increased by increasing
the sharpness of excitation or by increasing the number of excitations with equal sharp-
ness per unit time. Speech at higher speaking rates will havemore number of excitations
per unit time than the speech spoken at normal and slow speaking rates. More excitations
in a unit time will result in a signal that has greater energy.This may be the reason for
perceiving fast speech as louder than normal and slow speech. A parameter to measure
32
−0.1 −0.05 0 0.05 0.1 0
0.1
0.2
0.3
µA−µ
B
d KL(A
, B)
(b)
−0.1 −0.05 0 0.05 0.1 0
0.1
0.2
0.3
µA−µ
B
d KL(A
, B)
(d)
−0.1 −0.05 0 0.05 0.1 0
0.1
0.2
0.3
µA−µ
Bd K
L(A,B
)
(a)
−0.1 −0.05 0 0.05 0.1 0
0.1
0.2
0.3
µA−µ
B
d KL(A
,B)
(c)
Fig. 4.5: Variation of loudness measure (η) with speaking rate.(a) and (b) show the results of intra-class and inter-class compar-isons, respectively, for fast vs normal case. (c) and (d) show theintra-class and inter-class comparisons, respectively, for slow vsnormal case.
the perceptual loudness in the case of speech at different speaking rates is obtained by
normalizing the loudness measure (η) by the pitch period. The measure is defined as
ηp =η
t0, whereη is considered as a measure of the loudness at an instant of significant
excitation, andt0 is pitch period (measured in seconds) at that instant. Figs.4.4(b) and (d)
show the distributions ofηp for the two male speakers of Figs. 4.4(a) and (c). From these
figures it is observed that discrimination between the distributions ofηp is better than that
between the distributions ofη. But the distributions ofηp are closer to each other when
compared to the case of instantaneousF0 andǫ. Note that this discrimination is sufficient
to model the difference in loudness between speech at different speaking rates. In the case
of speech at different loudness levels, there is significant increase in loudness, which is
captured by the loudness measure (η) [63]. On the other hand, in the case of speech at
different speaking rates, the change in the loudness measure is not significant. This small
change in loudness due to increased number of pitch periods per unit time is captured by
the proposed perceptual loudness measureηp.
33
−40 −20 0 20 400
0.4
0.8
1.2
µA−µ
B
d KL(A
, B)
(b)
−40 −20 0 20 400
0.4
0.8
1.2
µA−µ
B
d KL(A
, B)
(d)
−40 −20 0 20 40 0
0.4
0.8
1.2
µA−µ
B
d KL(A
,B)
(a)
−40 −20 0 20 400
0.4
0.8
1.2
µA−µ
B
d KL(A
,B)
(c)
Fig. 4.6: Variation of proposed measure (ηp) of perceptual loud-ness with speaking rate. (a) and (b) show the results of intra-classand inter-class comparisons, respectively, for fast vs normal case.(c) and (d) show the results of intra-class and inter-class compar-isons, respectively, for slow vs normal case.
The KL divergence measure is used to evaluate the variationsof ηp with speaking rate.
The evaluation procedure is similar to the one used in the cases of instantaneousF0, ǫ,
andη. Figs. 4.6(a) and (b) show the fast vs normal intra-class andinter-class comparison
points, respectively. Figs. 4.6(c) and (d) show the slow vs normal intra-class and inter-
class comparison points, respectively. The spread of the inter-class comparison points
(Figs. 4.6(b) and (d)) is greater than that of the intra-class comparison points (Figs. 4.6(a)
and (c)) in both the fast vs normal and the slow vs normal cases. The difference between
spreads of the inter-class and intra-class comparison points in the case ofηp is greater
than that in the case ofη (Fig. 4.5). Also the slow vs normal inter-class comparison points
(Fig. 4.6(d)) are spread both in the first and second quadrants. This implies that for some
speakers, normal speech sounds louder than slow speech, while for other speakers it is
the opposite. The fast vs normal inter-class comparison points (Fig. 4.5(b)) for the case
of η are spread across the first and second quadrants, contrary tothe perceptual loudness
34
scores. But in the case ofηp they are spread only in the first quadrant (Fig. 4.6(b)), which
implies that fast speech sounds louder than normal speech inmost cases, an observation
which correlates with perceptual loudness scores. Therefore the proposed measure (ηp)
correlates well with the perceptual loudness scores in boththe fast vs normal and the slow
vs normal cases, which implies thatηp is a better measure of perceptual loudness thanη,
for speech at different speaking rates.
A cumulative probability density function (CPDF) is used toidentify the number of
cases in whichηp increases (or decreases) when speaking rate is decreased (or increased).
Similar to the cases of instantaneousF0 (shown in Sec. 3.3) and strength of excitation
at epoch (shown in Sec. 3.4), two CPDFs are computed from the difference between
mean values ofηp of a fast utterance and the corresponding normal utterance (denoted by
∆µFN), and mean values ofηp of a normal utterance and the corresponding slow utterance.
Fig. 4.7(a) and (b) shows the CPDFs constructed from the values of∆µFN and∆µNS. It can
be observed from Fig. 4.7(a) that CPDF is equal to 0.25 when∆µFN = 0. This implies that
for 25 % of the values∆µFN ≤ 0 and for 75 % of the values∆µFN > 0. This observation
shows that in 75 % of the casesηp increases when speaking rate is increased and in 25
% of the casesηp decreases when speaking rate is increased. Similarly, fromthe CPDF
constructed from the values of∆µNS (shown in Fig. 4.7(b)), it can be observed that CPDF
is equal to 0.5 when∆µNS = 0. This implies that form 50 % of the values∆µNS ≤ 0 and
for 50 % of the values∆µNS > 0. This observation shows that in 50 % of the casesηp
increases when speaking rate is decreased and in the other 50% of the casesǫ decreases
when speaking rate is decreased.
4.5 Summary
In this chapter, a study on the effect of speaking rate on the perception of loudness was
presented. A perceptual loudness test and analysis of the variations of an objective loud-
ness measure extracted from speech signals was performed. Observations of the percep-
tual loudness test results showed that for most speakers fast speech was perceived louder
than normal speech, whereas the loudness difference between normal speech and slow
35
−40 −20 0 20 40 0
0.25
0.5
0.75
1
∆µFN
Cum
ulat
ive
prob
abili
ty d
ensi
ty fu
nctio
n (a)
−40 −20 0 20 40 0
0.25
0.5
0.75
1
∆µSN
Cum
ulat
ive
prob
abili
ty d
ensi
ty fu
nctio
n (b)
Fig. 4.7: Cumulative probability density function of difference be-tween mean values of proposed loudness measure (ηp) of (a) fastutterances and normal utterances, and (b) slow utterances andnormal utterances.
speech is speaker specific. Analysis of variations of loudness measure (η) did not show
significant changes with speaking rate. A modified measure ofloudness (ηp) in the case
of speech at different speaking rates is proposed and its variations seems tocorrelate well
with subjective loudness variations.
36
Chapter 5
Incorporation of excitation source
variations in synthesis of speech at
different speaking rates
When humans modify speaking rate the obvious change occurs in the duration. There are
several approaches in literature for duration modificationof a given speech signal. Some
of these approaches use sinusoidal model, pitch synchronous overlap and add (PSOLA),
and phase vocoders to modify the duration [65, 66]. These methods modify the speech
signal directly to achieve the desired duration modification, which may produce some
spectral and phase distortions. Modification of the linear prediction (LP) residual to
achieve the desired duration modification was proposed in [67]. Modification of dura-
tion in the residual domain will reduce the spectral and phase distortions [67]. All the
above mentioned methods modify the duration of the speech signals uniformly. In gen-
eral, it is observed that all speech regions are not uniformly modified with changes in the
speaking rate. There are a few methods suggested in the literature to perform nonuniform
duration modification [66, 55, 54]. The assumption in most nonuniform duration mod-
ification methods is that compression and expansion do not occur during sounds which
are not voiced, and they occur during voicing due to changes in the speaking rates. To
perform nonuniform duration modification, voicing probability derived from sinusoidal
37
pitch estimate was used in [66], information from voicing onset time was used in [54].
It was shown in previous chapters (chapters 2, 3 and 4) and several studies in liter-
ature that, along with duration, various other acoustic features also vary when speaking
rate is changed [39]. These changes occur at subsegmental (less than pitch period), seg-
mental, and suprasegmental level. For producing natural sounding synthetic speech at
different speaking rates, not only the overall duration, but thesubsegmental, segmental,
and suprasegmental level variations also need to be captured and incorporated in synthe-
sis. This paper attempts to incorporate some of the variations at subsegmental level into
a nonuniform duration modification algorithm. The nonuniform duration modification al-
gorithm is based on the epoch-based uniform duration modification approach proposed in
[54].
In section 5.1, analysis of variations in duration of different sound units is presented. A
nonuniform duration modification method, which incorporates duration variations shown
in section 5.1 and variations of the excitation source features described in chapter 3, is
described in section 5.2. Evaluation of the proposed methodis given in section 5.3. Sec-
tion 5.4 summarizes the chapter.
5.1 Variation of durations of voiced, unvoiced and silence
Speech signal can be considered as the output of a linear system (vocal-tract) excited by
the vibrations of vocal folds. Based on the type of vibrationof the vocal folds, speech
signal can be divided into three regions, namely voiced, unvoiced and silence regions.
The vibration of vocal folds in voiced speech is periodic, where as the vibration in the
unvoiced regions is random and the excitation to the vocal-tract system in these regions
can be modelled as noise-like. Vibrations of the vocal foldsceases in the case of silence
regions. In this section, analysis of the variations of durations of these regions with speak-
ing rate is presented, to understand the effect of speaking rate variation on the gross level
characteristics of the excitation source.
38
5.1.1 Identification of voiced, unvoiced and silence regions
During production of voiced speech, vibration of the vocal folds is prominent with high
strength of excitation. In the absence of vocal fold vibration, the vocal-tract system can
be considered to be excited by random noise, as in the case of fricatives. The energy of
the random noise excitation is distributed both in time and frequency domains. While the
energy of an impulse is distributed uniformly in the frequency domain, and it is highly
concentrated in the time-domain. As a result, the zero frequency filtered signal exhibits
significantly lower amplitudes for random noise excitationcompared to the impulse-like
excitation. It was shown that the energy of the zero-frequency filtered signal described in
section 3.1 can be used to detect the regions of significant vocal folds vibration (voiced
regions). Fig. 5.1(b) shows the zero frequency filtered signal of a speech signal shown
in Fig. 5.1(a). The amplitude of the zero frequency filtered signal is significantly higher
in the voiced region compared to nonvoiced regions (for example, the region around 0.3
seconds in Fig. 5.1). The energy of the filtered signal (vz f r) over an interval of 10 ms is
used as the feature for discriminating between voiced and nonvoiced speech. The binary
voiced-nonvoiced signal is computed as
dvnv[n] =
1, if yz f r[n] > 0.5
0, otherwise,(5.1)
whereyz f r[n] = 1− e(−10×vz f r[n]) . Fig. 5.1 (d) shows the voiced-nonvoiced decision for
speech signal shown in Fig. 5.1 (a).
Nonvoiced regions consist of silence and unvoiced regions of speech. Unvoiced re-
gions typically have a higher spectral energy than silence regions at higher frequencies
(around 3000 Hz). This information is used to separate silence regions and unvoiced re-
gions in the nonvoiced regions. A resonator located at 3000 Hz and having a bandwidth
of 100 Hz is used to filter the speech signals. The output signal will have frequencies only
in the regions around 3000 Hz. The system function of a two pole resonator having center
39
0 0.1 0.2 0.3 0.4 0.5−1
0
1
(a)
0 0.1 0.2 0.3 0.4 0.5−1
0
1
(b)
0 0.1 0.2 0.3 0.4 0.50
0.5
1
(c)
0 0.1 0.2 0.3 0.4 0.50
0.6
1.2
(d)
Time (s)Fig. 5.1: Illustration of voiced-nonvoiced discrimination using zero fre-quency filtered signal. (a) Segment of a speech signal, (b) zero frequencyfiltered signal, (c) energy of the zero frequency filtered signal, and (d) binaryvoiced-nonvoiced signal.
frequency at 3000 Hz and bandwidth of 100 Hz is given by
H(z) =1
1− 2e−π
8 cos(5π4 )z−1 + e−
π
4 z−2. (5.2)
A high frequency signal (xuv[n]), is obtained by filtering the speech signal throughH(z).
The energy ofxuv[n] over a 10 ms window represented byvuv[n], is used to discriminate
unvoiced and silence regions. A binary unvoiced-silence decision is computed as
duv[n] =
1, if vuv[n] > vt
0, otherwise,(5.3)
wherevt is the threshold. The value of the threshold is the maximum ofaverage energies
of starting 200 ms and final 200 ms ofvuv[n]. It is based on the assumption that there is a
silence region at the beginning and ending of a sentence. Fig. 5.2(d) shows the unvoiced
decision plot obtained using eqn. (5.3). The remaining regions between start and end
40
0 0.1 0.2 0.3 0.4 0.5−1
0
1
(a)
0 0.1 0.2 0.3 0.4 0.5−1
0
1
(b)
0 0.1 0.2 0.3 0.4 0.50
0.5
1
(c)
0 0.1 0.2 0.3 0.4 0.50
0.6
1.2
(d)
Time (s)Fig. 5.2: Illustration of unvoiced-silence discrimination using output of theresonator H(z) (equation 5.3). (a) Segment of a speech signal, (b) output ofthe resonator H(z), (c) energy of the filtered signal, and (d) binary unvoiced-silence signal.
points of an utterances are chosen as silence regions. From Fig. 5.2 it is clear that the
proposed boundary identification method is fairly reliable, and these boundaries are used
to compute the durations of voiced, unvoiced and silence regions in a given sentence.
5.1.2 Duration variations
For each utterance in the database, the durations of voiced,unvoiced and silence regions
are obtained. Measures used to analyze the duration variations of the segments with
speaking rate are, Percentage deviation of duration of a segment and duration modification
factor of a segment. The termsegmentrefers to any one of{voiced, unvoiced, silence}
regions. Percentage deviation of duration of a segment is determined by
D =y− x
x× 100
41
Table 5.1: Percentage deviation in the durations of speech segments for different speakingrates
Speech Normal to fast Normal to slowSegment µ σ µ σ
voiced -22.71 8.99 37.67 20.88unvoiced -26.90 25.68 51.64 50.24silence -16.49 10.75 40.79 40.66
where x and y are durations of reference and test segments. Segments under normal
speaking rate are chosen as reference segments, where as segments under fast or slow
speaking rates are chosen as test segments.
For each of the voiced, unvoiced and silence regions the percentage deviation of du-
ration is computed when speaking rate is changed from normalto fast, or normal to slow.
The details of percentage deviation of duration are given inTable 5.1. The numbers shown
in Table 5.1 are mean and standard deviation of the percentage deviation of duration of
segments computed using the 250 utterances (25 speaker× 10 utterances). Negative sign
of mean indicates decrease in duration and positive sign indicates increase in duration.
Table 5.1 shows that the percentage deviation of duration ofunvoiced regions is greater
than that of voiced regions in both normal to fast and normal to slow cases. Unvoiced
regions consist of unvoiced consonants (fricatives and unvoiced stops), whereas voiced
regions consist of vowel and voiced consonants. Previous studies have reported that du-
rations of vowels and pauses changed significantly when the speaking rate is changed,
and durations of consonants change very little [54]. This study shows that not only the
durations of voiced regions, but the durations of unvoiced regions also undergo significant
changes when the speaking rate is changed. Also, the percentage change of duration is
less in normal to fast change compared to normal to slow change, which correlate with
previous studies [54]. Variance (σ) values shown in Table 5.1 are large and comparable
with mean (µ) values. In order to capture the variation in durations of these segments,
relation between duration modification factor (α) of these units andα of an utterance is
analyzed.
The duration modification factor of a unit is defined asα = t f
ti, wheret f andti are the
initial and final durations, when the speaking rate is changed from normal to non-normal
42
0 0.5 10
0.25
0.75
1
αS
α v
0 0.5 10
0.25
0.75
1
αS
α u
0 0.5 10
0.25
0.75
1
αS
α p
1 2 3 41
2
3
4
αS
α v
1 2 3 41
2
3
4
αS
α u
1 2 3 41
2
3
4
αS
α p
(f)(e)(d)
(a) (b) (c)
Fig. 5.3: Scatter plots of duration modification factors of a region vs thecorresponding utterance. (a), (b) and (c) show the scatter plots of αv vs αS,αu vs αS, and αp vs αS, respectively in normal to fast conversion. Similarly,(d), (e) and (f) show the scatter plots in normal to slow conversion. In eachplot, straight line that fits the cluster is shown by a solid line.
(fast or slow) speaking rate. The duration modification factors of voiced, unvoiced, silence
and whole sentence are represented asαv, αu, αp andαS, respectively. Ifαr represents
any one of voiced, unvoiced and silence regions, it is analyzed with reference to theαS to
identify the relationship betweenα of a region and the utterance. Fig. 5.3 shows the scatter
plots ofαr vsαS in both normal to fast and normal to slow cases. In order to identify the
relationship betweenα of a region and the utterance, a polynomial curve fitting algorithm
was used to find a straight line that best fits the clusters shown in Fig. 5.3. The slopes of
the lines that fit the cluster and the root mean square error (RMSE) of the fits are given in
Table 5.2
Table 5.2 shows that the error of the fit is highest for silenceregions and lowest for
voiced regions in both normal to fast and normal to slow cases. This implies that vari-
ability in the way in which voiced duration is modified, is less than that of unvoiced and
silence regions. Also, error of the fit is more for normal to slow case than for normal to
43
Table 5.2: Slopes (m) of the straight lines that fit the clusters shown in Fig. 5.3, and root meansquared error (RMSE) of the fits to illustrate the variation in duration modification of differentunits.
Speech Normal to fast Normal to slowSegment m RMSE m RMSEvoiced 0.68 0.057 0.44 0.145
unvoiced 0.69 0.148 0.63 0.4902silence 1.434 0.175 2.39 0.6482
fast case. This implies that variability in duration modification of regions during normal
to fast conversion is less than for normal to slow conversion. If the slope of the line that
fits the cluster is approximately equal to 1, thenαr is equal toαS. Likewise, if the slope
of the line is less than 1, thenαr is less thanαS, and vice-versa. Observing the slopes of
lines that fit the clusters, the slopes of lines corresponding to voiced and unvoiced regions
are less than 1 and that of silence region are more than 1. Thisimplies that in both normal
to fast and normal to slow conversions, the duration modification factor of voiced and un-
voiced regions is less than the duration modification factorof the utterance, and duration
modification factor of silence regions is higher than than duration modification factor of
the utterance. This analysis showed that humans rely more onvariations of silence re-
gions, than on variations of voiced and unvoiced regions during the production of speech
at different speaking rates.
5.2 Synthesis of speech at different speaking rate
It was shown in section 5.1.2 that nonuniform duration modification occurs when humans
produced speech at different speaking rates by analyzing the variation in the durations
of voiced, unvoiced and silence regions. When the task is to modify the duration of
an utterance by a required factor, prior information about duration modification factors
of voiced, unvoiced and silence regions is required. The information about the factor
with which the duration of a segment has to be modified, when the duration modification
factor of an utterance is available, is obtained from the lines that fit the clusters shown in
Fig. 5.3. When the line is in the form ofy = mx+ c, as ‘m’ and ‘c’ are available, abscissa
denotes the duration modification factor of an utterance andordinate denotes the duration
44
modification factor of the unit. A nonuniform duration modification method which uses
the information of duration modification factors of a segment when duration modification
of an utterance is available, is proposed in this section.
The nonuniform duration modification method is based on epoch-based duration mod-
ification method proposed in [67]. There are four main steps involved in the epoch-based
time scale modification method (uniform duration modification) [67]:
1. Deriving the instants of significant excitation (epochs)from the LP residual signal.
2. Deriving a modified (new) epoch sequence according to the desired duration modi-
fication factor.
3. Deriving a modified LP residual signal from the modified epoch sequence.
4. Synthesizing speech using the modified LP residual and theLPCs.
It involves deriving a new excitation (LP residual) signal by incorporating the desired
modification in the duration of the utterance. This is done byfirst creating a new sequence
of epochs from the original sequence of epochs. Each epoch isassociated with the time,
pitch period, linear prediction (LP) residual and linear prediction coefficients (LPCs). The
new epoch sequence consists of either insertion of new epochs for time scale expansion,
or deletion of some epochs for time scale compression. The residual is accessed from the
original epochs, and it is modified according to the new epochsequence. To increase the
duration, some portions of the residual are replicated at specific locations. Similarly for
reducing the duration, some portions of the residual are omitted at specific locations.
The proposed nonuniform duration modification method is as follows: Linear predic-
tion residual of the speech signal is computed by performinga 12th order linear prediction
(LP) analysis using a 20 ms frame size and 5 ms frame shift. Identification of the bound-
aries of voiced, unvoiced and silence regions is performed using the method described
in Sec. 5.1.1. Modification of the duration of the LP residualis performed with dura-
tion modification factor equal to voiced duration modification factor using epoch-based
method [67]. This step takes care of the duration modification of voiced regions. Resid-
ual corresponding to silence and unvoiced regions are resampled to match the required
45
duration, and silence and unvoiced regions in scaled residual signal are replaced by the
resampled residual signal. This step takes care of the required duration modification of
silence and unvoiced region. The filter coefficients (LPCs) are updated depending on the
length of the modified LP residual. Speech for the desired duration modification can be
synthesized by exciting the all-pole filter using the modified LP residual.
It was shown in chapters 3 and 4 that when speaking rate is changed, along with mod-
ification of duration, the features related excitation source also changes. The variations of
the instantaneousF0 are incorporated in the proposed nonuniform duration modification
method using the epoch-based pitch modification algorithm,proposed in [67].
5.3 Evaluation of synthetic speech at different speaking
rates
The performance of the proposed method for synthesis of speech at different speaking
rates is compared with epoch-based duration modification method using perceptual eval-
uation. Perceptual evaluation was carried out by conducting subjective tests with 10 re-
search scholars in the age group of 21-35 years. Two sentences were chosen to perform the
test. Speech signals were derived for the duration modification factors from 0.5 to 1.5 in
the steps of 0.2. For each modification factor, three types ofspeech signals were derived.
They are: speech signals using (i) uniform duration modification (U), (ii) nonuniform du-
ration modification (NU), and (iii) nonuniform duration modification with instantaneous
F0 variations incorporated (NU + F0). In theNU + F0 method, the instantaneousF0 is
modified by constant factor, 1.2 and 0.8 in for the cases of normal to fast and normal to
slow conversions, respectively. The tests were conducted in the laboratory environment
by playing the speech signals through headphones. Two typesof tests: (i) mean opinion
score (MOS) test in which the listener evaluates on a score from 1 (worst) to 5 (best)
based on the quality and perceptual distortion, and (ii) AB ranking test where the listener
has to choose an utterance from the presented utterances.
The mean opinion scores (MOS) for duration modification factor is shown in Ta-
46
Table 5.3: Mean opinion scores and AB ranking test (in %).
PPPPPPPPP
methodα
0.5 0.7 0.9 1.1 1.3 1.5 AB-test
U 2.0 3.0 3.0 3.0 3.0 3.0 12.5
NU 2.6 3.3 3.6 3.0 3.3 3.3 50.0
NU + F0 2.6 3.0 4.0 3.3 3.0 3.0 37.5
ble 5.3. It can be observed from perceptual evaluation that speech synthesized fromNU
method has higher scores thanU for almost all the duration modification factors (expect
for α=1.1). Similarly,NU + F0 has higher scores thanU for some values ofα and equal
for the rest. AB ranking test (shown in last column of Tab. 5.3), a significant number of
listeners have chosen speech signals synthesized usingNU andNU+F0 methods (50 %+
37.5 %). It was shown in Sec. 3.3 that instantaneousF0 varies significantly with speaking
rate, but the difference between scores ofNU andNU + F0 is not as expected. The qual-
ity of NU + F0 can be improved by modifying the instantaneousF0 speaker-specifically.
Also, a study of the temporal variations of the instantaneous F0 and incorporating these
variations in synthesis of speaking rate may improve the quality.
5.4 Conclusions
In this chapter, the effect of speaking rate on durations of voiced, unvoiced and silence
regions was studied. The study showed that variability is less in voiced duration modifica-
tion than unvoiced and silence duration modification. Also voiced duration modification
factor is less than unvoiced and silence segments. A nonuniform duration modification
method is proposed which is based in the epoch-based uniformduration modification
method, which uses the needed information about the duration modification factors of
voiced, unvoiced and silence regions from scatter plots ofαr vs αS. The variations of
instantaneousF0 have been incorporated in the proposed nonuniform durationmodifica-
tion method. Perceptual evaluation results showed that speech signals synthesized from
the proposed methods have higher scores the epoch-based duration modification method.
The quality of synthesis of speech at different speaking rates can be improved further by
incorporating variations in the perceived loudness and temporal variations of the instan-
taneousF0.
47
Chapter 6
Summary and Conclusions
6.1 Summary of the work
Speaking rate and its effects have been studied in the literature using features of articu-
latory movements, and the features at segmental and suprasegmental level derived from
the acoustic speech signal. This paper presents an analysisof the effect of speaking rate
and its change, on the features of source of excitation of thevocal tract system. Three
features related to source of excitation, namely, (a) instantaneous fundamental frequency,
(b) strength of excitation, and (c) a measure of perceived loudness are derived. Of these,
the instantaneousF0 and the strength of excitation at epoch are derived from the zero-
frequency filtered signal. The zero-frequency filtered signal carries information about the
sequence of impulse-like excitations involved in the production of voiced speech. The
featureη is derived from the Hilbert envelope of the linear prediction residual of speech
signal, where the residual is an estimate of the source of excitation.
The variations of instantaneousF0 and ǫ, with speaking rate showed a systematic
trend. Speech at high speaking rates (fast speech) generally has higher instantaneousF0
and lowerǫ than the speech spoken at low speaking rates (slow speech). Most of the
speakers increased their instantaneousF0 when speaking fast, but when the speaking rate
is decreased, the instantaneousF0 increased for some speakers and decreased for some
others. For most of the speakers,ǫ was observed to decrease when speaking fast, but when
48
speaking rate is decreased,ǫ decreased for some speakers and increased for some others.
In order to generalize the observations, the distribution of each feature for a given speaker
and a given speaking rate was approximated by a univariate Gaussian probability density
function. Kullback-Leibler divergence was then employed to estimate the discrimination
between the distributions of a feature for two different speaking rates.
The effect of speaking rate on the perception of loudness was also investigated. Sub-
jective listening tests were conducted to determine whether fast or slow speech sounded
louder compared to normal speech. From the subjective teststhe following conclusions
are drawn: (i) The subjects perceived a difference in loudness between speech at nonnor-
mal (fast or slow) and normal speaking rates, for a significant number (about 80 %) of
the utterances used for listening. (ii) Fast speech sounds louder than normal speech in
a majority of utterances (73 %). (iii) Slow speech sounds louder than normal for some
utterances (33 %), while the reverse is true for some other utterances (43 %). An acous-
tic featureη which measures perceived loudness was employed to measure the loudness
differences between speech at different speaking rates. The featureη did not show signif-
icant difference between utterances of different speaking rates, unlike the scores obtained
from perceptual studies. A modified measure of perceptual loudness, denoted byηp, is
defined on the basis of the number of impulse-like excitations produced per unit time. The
variation ofηp with speaking rate was analyzed. Variations ofηp correlated well with the
results of subjective studies. Thereforeηp is a better measure of perceptual loudness than
η for the case of speech at different speaking rates.
In order to understand the significance of excitation sourcevariations to the perception
of speaking rate, the observed excitation source variations were incorporated in uniform
duration modification method. Perceptual evaluation results showed that speech signals
synthesized from the proposed methods have higher scores the epoch-based duration mod-
ification method.
49
6.2 Major contributions of the work
The important contribution of the research work reported inthis thesis is the ‘Analysis
of speech at different speaking rates using excitation source features’. The excitation
source used are instantaneous fundamental frequency, strength of excitation at epoch and
perceived loudness. The major contributions of this thesisare:
• Studied the significance of speaking rate in human speech communication.
• Analyzed the variations of excitation source features withspeaking rate.
• Analyzed the effect of speaking rate on perceived loudness and proposed a loudness
measure which can capture the loudness variations in speechat different speaking
rates.
• Analyzed the duration variations of speech segments with respect to duration vari-
ations of the utterance.
• Proposed a nonuniform duration modification method which incorporates the exci-
tation source variations to synthesize speech at different speaking rates.
6.3 Directions for future work
• The research work in this thesis studied the variation of three excitation source fea-
tures with speaking rate. Similar studies can be conducted on other features related
excitation source such as: normalized error, open quotientratio, closed quotient
ratio, voice quality related features etc.
• When humans modify speaking rate, the changes in the features occur in a nonuni-
form fashion. Identification of regions in which significantchanges in the excitation
source occur is an interesting problem.
• Variations of other excitation source features can be incorporated in the synthesis
of speech at different speaking rates.
50
• Combining the variations in features related to vocal-tract system and excitation
source to synthesize speech at different speaking rates.
51
List of Publications
Journals
1. Sri Harish Reddy M, Guruprasad S, and B. Yegnanarayana, “Analysis of speech at
different speaking rates”, to be submitted toComputer Speech& Language.
Conferences
1. Sri Harish Reddy M, B. Yegnanarayana, “Incorporation of Excitation Source and
Duration Variations in Speech Synthesized at Different Speaking Rates”, accepted
in Proc. Speech Prosody 2010, Illinois, USA.
2. Sri Harish Reddy M, Sudheer Kumar K, Guruprasad S and B. Yegnanarayana,
“Subsegmental Features for Analysis of Speech at Different Speaking Rates” in
Proc. ICON 2009, Hyderabad, India.
52
References
[1] J. Laver,Principles of phonetics. Cambridge: Cambridge University Press, 1994.
[2] I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech:a review of the literature on human vocal emotion,”J. Acoust. Soc. Am., vol. 93,no. 2, pp. 1097–1108, Feb. 1993.
[3] R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,”J. Pers.Soc. Psychol., vol. 70, no. 3, pp. 614–636, Mar. 1996.
[4] K. R. Scherer, “Acoustic concomitants of emotional dimensions: Judging affect fromsynthesized tone sequences,”Nonverbal Communication, p. 249253, 1974.
[5] W. Apple, L. Streeter, and R. Krauss, “Effects of pitch and speech rate on personalattributions,”J. Pers. Soc. Psychol., vol. 37, no. 5, pp. 715–727, May 1979.
[6] S. Lively, D. Pisoni, W. van Summers, and R. Bernacki, “Effects of cognitive work-load on speech production: acoustic analyses and perceptual consequences,”J.Acoust. Soc. Am., vol. 93, no. 5, pp. 2962–2973, May 1993.
[7] W. Levelt,Speaking: From Intention to Articulation. Cambridge, MA: MIT Press,1989.
[8] I. Fnagy and K. Magdics, “Speed of utterance in phrases ofdifferent lengths,”Lan-guage and Speech, vol. 3, no. 4, pp. 179–192, 1960.
[9] B. Smith, B. Brown, W. Strong, and A. Rencher, “Effects of speech rate on person-ality perception,”Language and Speech, vol. 18, pp. 145–152, 1975.
[10] J. v. d. Weijer, “Language input to a prelingual infant,” in Proc. GALA ’97 Conf. onLanguage Acquisition, Edinburgh, Scotland, 1997, pp. 290–293.
[11] R. M. Uchanski, S. S. Choi, L. Braida, R. C. M., and N. Durlach, “Speaking clearlyfor the hard of hearing iv: further studies of the role of speaking rate,” J. Speech.Hear. Res., vol. 39, pp. 494–509, 1996.
[12] D. Abercrombie,Elements of General Phonetics. Edinburgh: Edinburgh UniversityPress, 1967.
[13] J. Trouvain, “Tempo varation in speech prpduction: Implication for speech synthe-sis,” PhD Thesis, Saarland University, Saarbrcken, Germany, 2004.
53
[14] H. Nanjo and T. Kawahara, “Language model and speaking rate adaptation for spon-taneous presentation speech recognition,” vol. 12, no. 4, pp. 391–400, Jul. 2004.
[15] B. Zellner, “Fast and slow speech rate: a characterisation for French,” inProceedingsof the International Conference on Spoken Language Processing, vol. 7, Sydney,Australia, Dec. 1998, pp. 3159–3163.
[16] N. Morgan and E. Fosler-Lussier, “Combining multiple estimators of speaking rate,”in Proc. Int. Conf. Acoust., Speech, Signal Process., vol. 2, Seattle, Washington,USA, May 1998, pp. 729–732.
[17] J. Zheng, H. Franco, and A. Stolcke, “Rate-dependent acoustic modeling for largevocabulary conversational speech recognition,” inProc. ISCA Tutorial and ResearchWorkshop on Automatic Speech Recognition: Challenges for the New Millennium,Paris, France, 2000, pp. 145–149.
[18] T. Pfau, R. Faltlhauser, and G. Ruske, “A combination ofspeaker normalization andspeech rate normalization for automatic speech recognition,” in Proceedings of theInternational Conference on Spoken Language Processing, vol. 4, Beijing, China,Oct. 2000, pp. 362–365.
[19] H. Nanjo and T. Kawahara, “Speaking-rate dependent decoding and adaptation forspontaneous lecture speech recognition,” inProc. Int. Conf. Acoust., Speech, SignalProcess., Orlando, Florida, USA, May 2002, pp. 725–728.
[20] E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur, “Prosody-based automaticsegmentation of speech into sentences and topics,”Speech Commun., vol. 32, no.1-2, pp. 127–154, 2000.
[21] M. Siegler, “Measuring and compensating for the effects of speech rate in large vo-cabulary continuous speech recognition,” MS Thesis, Carnegie Melon Univ., Pitts-burgh, PA, USA, 1995.
[22] P. Mermelstein, “Automatic segmentation of speech into syllabic units,”J. Acoust.Soc. Am., vol. 58, no. 4, pp. 880–883, 1975.
[23] J. S. Garofalo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D.S. Pallett, andN. L. Dahlgren, “The DARPA TIMIT acoustic-phonetic continuous speech corpuscdrom,” Linguistic Data Consortium, Philadelphia, PA, USA, 1993.
[24] CMU-ARCTIC speech synthesis databases. [Online]. Available: http://festvox.org/cmu arctic/index.html
[25] D. B. Paul and J. M. Baker, “The design for the wall streetjournal-based csr corpus,”in HLT ’91: Proceedings of the workshop on Speech and Natural Language. Har-riman, New York: Association for Computational Linguistics, 1992, pp. 357–362.
[26] N. Mirghafori, E. Fosler, and N. Morgan, “Fast speakersin large vocabulary contin-uous speech recognition: analysis and antidotes,” inProc. European Conf. SpeechProces. and Techn., Madrid, Spain, 1995, pp. 491–494.
54
[27] C. Fougeron and S.-A. Jun, “Rate effects on French intonation: Prosodic organiza-tion and phonetic realization,”J. Phonetics, vol. 26, no. 1, pp. 45–69, Jan. 1998.
[28] B. Lindbloom, “Spectrographic study of vowel reduction,” J. Acoust. Soc. Am.,vol. 35, no. 11, pp. 1773–1781, Nov. 1963.
[29] R. D. Kent and K. L. Moll, “Cinefluorographic analyses ofselected lingual conso-nants,”J. Speech. Hear. Res., vol. 15, pp. 453–473, 1972.
[30] T. Gay, “Mechanisms in the control of speech rate,”Phonetica, vol. 38, pp. 148–158,1981.
[31] B. Lindbloom, “Articulatory activity in vowels,”Speech Transmission Laboratory:Quarterly Progress and Status Report (KTH, Stockholm), vol. 2, pp. 1–5, 1964.
[32] J. H. Abbs, “The influence of the gamma motor system on jawmovements duringspeech: a theoretical framework and some preliminary observations,” J. Speech.Hear. Res., vol. 16, pp. 175–200, 1973.
[33] D. P. Kuehn and K. Moll, “A cinefluographic investigation of CV and VC articula-tory velocities,”J. Phonetics, vol. 3, pp. 303–320, 1976.
[34] O. Engstrand, “Articulatory correlates of stress and speaking rate in swedish VCVutterances,”J. Acoust. Soc. Am., vol. 83, no. 5, pp. 1863–1875, May 1988.
[35] K. Munhall and A. Lofqvist, “Gestural aggregation in speech: Laryngeal gestures,”J. Phonetics, vol. 20, pp. 111–126, 1992.
[36] R. Krakow, “Nonsegmental influences on velum movement patterns: syllables, sen-tences, stress, and speaking rate,”Phonetics and phonology: Nasals, nasalization,and the velum, vol. 5, pp. 87–116, 1993.
[37] D. P. Kuehn, “A cineradiographic investigation of velar movement variables in twonormals,”Cleft Palate J., vol. 13, pp. 88–303, 1976.
[38] D. Ostry and K. G. Munhall, “Control of rate and durationof speech movements,”J. Acoust. Soc. Am., vol. 77, no. 2, pp. 640–648, 1985.
[39] T. Gay, “Effects of speaking rate on vowel formant movements,”J. Acoust. Soc. Am.,vol. 63, no. 1, pp. 223–230, Jan. 1978.
[40] T. Gay, T. Ushijima, H. Hirose, and F. S. Cooper, “Effect of speak-ing rate on labial consonant-vowel articulation,”J. Phonetics, vol. 2, pp. 47–63,1974.
[41] H. G. Tillmann and H. R. Pfitzinger, “Local speech rate: Relationships betweenarticulation and speech acoustics,” inProc. 15th Int. Conf. Phonetic Sciences,Barcelona, Spain, 2003, pp. 3177–3180.
[42] J. E. Flege, “Effects of speaking rate on tongue position and velocity of movementin vowel production,”J. Acoust. Soc. Am., vol. 84, no. 3, pp. 901–916, 1988.
55
[43] T. Gay and H. Hirose, “Effect of speaking rate on labial consonant produc-tion: a combined electromyographic high-speed motion picture study,”Phonetica,vol. 27, pp. 203–213, 1973.
[44] T. Gay and T. Ushijima, “Effect of speaking rate on stop consonant vowel articula-tion,” in Proc. of Speech Commun. Seminar, Stockholm, Sweden, 1975, pp. 205–209.
[45] L. Aijun and Z. Yiqing, “Speaking rate effects on discourse prosody in standardchinese,” inProc. of Speech Prosody, Campinas, Brazil, May 6-9 2008, pp. 499–452.
[46] J. Trouvain and M. Grice, “The effect of tempo on prosodic structure,” inProc.14th
Int. Conf. Phonetic Sciences, San Francisco, USA, 1999, pp. 1067–1070.
[47] M. Demol, W. Verhelst, and P. Verhoeve, “The duration ofspeech pauses in a mul-tilingual environment,” inProc. of INTERSPEECH, Antwrep, Belgium, Aug. 2007,pp. 990–993.
[48] A. Q. Summerfield, “Articulatory rate and perceptual constancy in phonetic percep-tion,” J. Experimental Psychology: Human Perception and Performance, vol. 7, pp.1074–1095, 1981.
[49] J. L. Miller and T. Baer, “Some effects of speaking rate on the production of/b/ and/w/,” J. Acoust. Soc. Am., vol. 73, no. 5, pp. 1751–1755, May 1983.
[50] J. L. Miller, K. P. Green, and A. Reeves, “Speaking rate and segments: a look at therelation between speech production and speech perception for the voicing contrast,”Phonetica, vol. 43, pp. 106–115, 1986.
[51] J. L. Miller and L. E. Volaitis, “Effects of speaking rate on the perceptual structureof a phonetic category,”Perception and Psychophysics, vol. 46, pp. 505–512, 1989.
[52] R. M. Theodore, J. L. Miller, and D. DeSteno, “The effects of speaking rate onvoice-onset-time is talker specific,” inProc.16th Int. Conf. Phonetic Sciences, Saar-brucken, Germany, 2007, pp. 473–476.
[53] R. H. Kessinger and S. E. Blumstein, “Effects of speaking rate on voice-onset time.and vowel production: Some implications for perception studies,” J. Phonetics,vol. 26, no. 2, pp. 117–128, Apr. 1998.
[54] K. S. Rao and B. Yegnanarayana, “Duration modification using glottal closure in-stants and vowel onset points,”Speech Commun., vol. 51, pp. 1263–1269, 2009.
[55] O. Donnellan, E. Jung, and E. Coyle, “Speech-adaptive time-scale modification forcomputer assisted language-learning,” inIEEE Int. Conf. on Advanced LearningTechnologies, Athens, Greece, Jul. 2003, pp. 165–169.
[56] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,”IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 8, pp. 1602–1613, Nov.2008.
56
[57] C. d’Alessandro and K. R. Scherer, “Voice quality: Functions, analysis and synthe-sis (VOQUAL’03),” ISCA Tutorial and Research Workshop, Geneva, Switzerland,Aug. 2003, http://archives.limsi.fr/VOQUAL/voicematerial.html (date last viewed04/08/2009).
[58] B. Yegnanarayana, K. S. R. Murty, and S. Rajendran, “Analysis of stop consonantsin indian languages using excitation source information inspeech signal,” inProc.Workshop Speech Anal. Process. Knowledge Discovery, Aalborg, Denmark, June4-6 2008.
[59] B. Yegnanarayana and K. S. R. Murty, “Event-based instantaneous fundamental fre-quency estimation from speech signals,”IEEE Trans. Audio, Speech, Lang. Process.,vol. 17, no. 4, pp. 614–624, May 2009.
[60] K. S. R. Murty, B. Yegnanarayana, and M. A. Joseph, “Characterization of glottalactivity from speech signals,”IEEE Signal Process. Letters, vol. 16, no. 6, pp. 469–472, June 2009.
[61] S. Kullback,Information Theory and Statistics. Mineola, New York: Dover Publi-cations Inc., 1968.
[62] T. M. Cover and J. A. Thomas,Elements of Information Theory. New York: Wiley,1991.
[63] G. Seshadri and B. Yegnanarayana, “Perceived loudnessof speech based on thecharacteristics of excitation source,”J. Acoust. Soc. Am., vol. 126, no. 4, pp. 2061–2071, Oct. 2009.
[64] A. V. Oppenheim and R. W. Schafer,Digital Signal Processing. Englewood Cliffs,New Jersey: Prentice Hall, 1975.
[65] E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech,”Speech Communication, vol. 16, pp. 175–206, 1995.
[66] T. Quatieri and R. McAulay, “Shape invariant time-scale and pitch modification ofspeech,”IEEE Trans. on Signal Process., vol. 40, no. 3, pp. 497–510, Mar 1992.
[67] K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of significantexcitation,”IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 972–980,May 2006.
57
top related