ANALYSIS OF SPEECH AT DIFFERENT SPEAKING ...web2py.iiit.ac.in/publications/default/download/masters...ANALYSIS OF SPEECH AT DIFFERENT SPEAKING RATES USING EXCITATION SOURCE INFORMATION

ANALYSIS OF SPEECH AT DIFFERENT SPEAKINGRATES USING EXCITATION SOURCE

INFORMATION

SRI HARISH REDDY MALLIDI

200431008

Master of Science (by Research)

Electronics and Communication Engineering

Speech and Vision Lab.

Language Technologies Research Centre

International Institute of Information Technology

Hyderabad, India

To my parents, friends and guide

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled“Analysis of Speech at Different

Speaking Rates Using Excitation Source Information” by Sri Harish Reddy Mallidi

(200431008), has been carried out under my supervision and is not submitted elsewhere

for a degree.

Date Adviser: Prof. B. Yegnanarayana

Acknowledgements

I would like to express my deepest respect and most sincere gratitude to Prof. B. Yeg-

nanarayana, for his constant guidance, encouragement at all stages of my work. I am

fortunate to have numerous technical discussions with him from which I have benefited

enormously. I thank him for the excellent research environment he has created for all of

us to learn.

I thank my thesis committee members, Prof. P. R. K Rao and Dr. Sudhir Madhav Rao

for sparing their valuable time to evaluate the progress of my research work. I am thankful

to Dr. Suryakanth, Mr. Kishore Prahallad and Dr. Rajendran for their immense support

and help through my research work. I am thankful to them for all the invaluable advise

on both technical and nontechnical matters. I thank my senior laboratory members for all

the cooperation, understanding and help I received from them. I will forever remember

the wonderful time I had with my friends.

Needless to mention the love and moral support of my family. This work would not

have been possible but for their support. Finally, I would like to dedicate this thesis to my

family, and my guide, Prof. B. Yegnanarayana.

Sri Harish Reddy Mallidi

Abstract

When humans modify speaking rate they do not perform a simpleexpansion or compres-

sion of the speech signal. In order to maintain the intelligibility and naturalness of the

speech, they modify some of the characteristics of the speech production mechanism in

a complex way. This causes the acoustic features extracted from the speech signal to

change in a complex way. These changes affect the performance of speech systems like

speech recognition, speaker recognition etc. Most of the studies on the effect of speaking

rate on acoustic features focus on features at segmental andsuprasegmental level. The

present work focuses on analysis of the effects of speaking rate on features at subseg-

mental level. Three features at subsegmental level, namely, instantaneous fundamental

frequency, strength of excitation of epoch and perceived loudness are chosen, and their

variation with speaking rate is studied.

It was observed that instantaneous fundamental frequency increases with increase in

speaking rate, but when speaking rate is decreased the changes in the instantaneous fun-

damental frequency is speaker-specific. Similar observations have been found in the case

of strength of excitation at epoch. Strength of excitation decreases with increase in speak-

ing rate, and the change in strength of excitation is speaker-specific when speaking rate

is decreased. The effect of speaking rate on the perception of loudness is also studied

using perceptual loudness studies. It was observed that fast speech was perceived louder

than normal speech for majority of speakers, whereas the difference between perception

of loudness of normal and slow speech is speaker-specific. Itwas also observed that

speaking rate does not have significant affect on objective loudness measure. A modified

measure of loudness in the case of speech at different speaking rates is proposed, and its

variations correlate with results from perceptual loudness tests.

The variations of subsegmental level features with speaking rate are incorporated in a

non-uniform duration modification method. Subjective studies on the synthesized speech

showed that incorporation of the subsegmental variation improved the quality of speech

at different speaking rates.

Keywords: Speaking rate, spontaneous speech, segmental features, instantaneous funda-

mental frequency, strength of excitation, perceived loudness and duration modification.

Contents

Abstract iii

List of Tables viii

List of Figures xi

1 Introduction 1

1.1 Sources of speaking rate variation . . . . . . . . . . . . . . . . . .. . 2

1.2 Spontaneous speech - a case study . . . . . . . . . . . . . . . . . . . .3

1.2.1 Speaking rate variation in spontaneous speech . . . . . .. . . . 4

1.2.2 Effect on the performance of speech systems . . . . . . . . . . 7

1.3 Objective and scope of the work . . . . . . . . . . . . . . . . . . . . . 8

1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Speaking rate - A Review 9

2.1 Studies based on articulatory dynamics . . . . . . . . . . . . . .. . . . 9

2.2 Studies based on acoustic features . . . . . . . . . . . . . . . . . .. . 10

2.2.1 Suprasegmental level . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Segmental level . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Effect of speaking rate on excitation source features 13

3.1 Extraction of excitation source features . . . . . . . . . . . .. . . . . 13

3.2 Speech material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Variation of instantaneousF0 . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Variation of strength of excitation . . . . . . . . . . . . . . . . .. . . 22

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Effect of speaking rate on loudness 27

4.1 Perceptual evaluation of loudness of speech at different speaking rates . 27

4.2 Extraction of loudness measure from the Hilbert envelope of the LP resid-

ual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Variation of loudness measure . . . . . . . . . . . . . . . . . . . . . .31

4.4 Proposed measure of perceptual loudness . . . . . . . . . . . . .. . . 32

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Incorporation of excitation source variations in synthesis of speech at dif-

ferent speaking rates 37

5.1 Variation of durations of voiced, unvoiced and silence .. . . . . . . . . 38

5.1.1 Identification of voiced, unvoiced and silence regions . . . . . . 39

5.1.2 Duration variations . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Synthesis of speech at different speaking rate . . . . . . . . . . . . . . 44

5.3 Evaluation of synthetic speech at different speaking rates . . . . . . . . 46

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Summary and Conclusions 48

6.1 Summary of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Major contributions of the work . . . . . . . . . . . . . . . . . . . . . 50

6.3 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . 50

List of Publications 52

References 53

List of Tables

5.1 Percentage deviation in the durations of speech segments for different

speaking rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Slopes (m) of the straight lines that fit the clusters shown in Fig. 5.3,and

root mean squared error (RMSE) of the fits to illustrate the variation in

duration modification of different units. . . . . . . . . . . . . . . . . . 44

5.3 Mean opinion scores and AB ranking test (in %). . . . . . . . . .. . . 47

List of Figures

1.1 Sources of speaking rate variation [13] . . . . . . . . . . . . . .. . . . 3

1.2 Mean subtracted histograms of syllable rate for spontaneous speech. . . 5

1.3 Mean subtracted histogram of syllable rate for read speech. . . . . . . . 6

3.1 (a) Segment of a speech signal from VOQUAL’03 [57] database, (b) zero-

frequency filtered signal, (c) differenced EGG signal and epoch locations

marked by arrows, (d) strength of excitation calculated from the zero-

frequency filtered signal, and (e) contour of instantaneousF0. . . . . . . 14

3.2 Distribution of the measure of speaking rate for fast, normal, and slow

utterances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Distributions of instantaneousF0 for 4 male speakers. In each case, the

solid (‘—’), the dashed (‘- - -’), and the dotted (‘· · · ’) lines correspond to

normal, fast, and slow utterances, respectively. . . . . . . . .. . . . . . 18

3.4 Variation of the instantaneousF0 with speaking rate. (a) and (b) show

the results of intra-class and inter-class comparisons, respectively, for fast

vs normal case. (c) and (d) show the results of intra-class and inter-class

comparisons, respectively, for slow vs normal case. . . . . . .. . . . . 20

3.5 Cumulative probability density function of difference between mean val-

ues of instantaneousF0 of (a) fast utterances and normal utterances, and

(b) slow utterances and normal utterances. . . . . . . . . . . . . . .. . 22

3.6 Distributions of strength of excitation (ǫ) for 4 male speakers. In each

case, the solid (‘—’), the dashed (‘- - -’), and the dotted (‘· · · ’) lines

correspond to normal, fast, and slow utterances, respectively. . . . . . . 23

3.7 Variation of the strength of excitation (ǫ) with speaking rate. (a) and (b)

show the results of intra-class and inter-class comparisons, respectively,

for fast vs normal case. (c) and (d) show the results of intra-class and

inter-class comparisons, respectively, for slow vs normalcase. . . . . . 24

ues of strength of excitation at epoch (ǫ) of (a) fast utterances and normal

utterances, and (b) slow utterances and normal utterances.. . . . . . . . 25

4.1 Evaluation of the effect of speaking rate on the perception of loudness.

Results of comparison of (a) fast speech and normal speech, and (b) nor-

mal speech and slow speech. . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 (a) Segment of a speech signal used from VOQUAL’03 speechdatabase

[57], (b) 12th order LP residual, (c) Hilbert envelope of the LP residual,

and (d) contour ofη extracted from the Hilbert envelope. . . . . . . . . 30

4.3 Overlapping segments of the Hilbert envelope of the LP residual in the

vicinity of epoch locations for (a) soft, (b) normal, and (c)loud utterances.

([63]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Distributions of loudness measure (η) and proposed measure (ηp) of per-

ceptual loudness for two male speakers, shown in (a), (c) and(b), (d),

respectively. In each case, the solid (‘—’), the dashed (‘- --’), and the

dotted (‘· · · ’) lines correspond to normal, fast, and slow utterances, re-

spectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Variation of loudness measure (η) with speaking rate. (a) and (b) show the

results of intra-class and inter-class comparisons, respectively, for fast vs

normal case. (c) and (d) show the intra-class and inter-class comparisons,

respectively, for slow vs normal case. . . . . . . . . . . . . . . . . . .33

4.6 Variation of proposed measure (ηp) of perceptual loudness with speaking

rate. (a) and (b) show the results of intra-class and inter-class compar-

isons, respectively, for fast vs normal case. (c) and (d) show the results of

intra-class and inter-class comparisons, respectively, for slow vs normal

case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

ues of proposed loudness measure (ηp) of (a) fast utterances and normal

utterances, and (b) slow utterances and normal utterances.. . . . . . . . 36

5.1 Illustration of voiced-nonvoiced discrimination using zero frequency fil-

tered signal. (a) Segment of a speech signal, (b) zero frequency filtered

signal, (c) energy of the zero frequency filtered signal, and(d) binary

voiced-nonvoiced signal. . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Illustration of unvoiced-silence discrimination using output of the res-

onatorH(z) (equation 5.3). (a) Segment of a speech signal, (b) output

of the resonatorH(z), (c) energy of the filtered signal, and (d) binary

unvoiced-silence signal. . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Scatter plots of duration modification factors of a region vs the corre-

sponding utterance. (a), (b) and (c) show the scatter plots of αv vsαS, αu

vsαS, andαp vsαS, respectively in normal to fast conversion. Similarly,

(d), (e) and (f) show the scatter plots in normal to slow conversion. In

each plot, straight line that fits the cluster is shown by a solid line. . . . 43

Chapter 1

Introduction

Information conveyed through a speech signal can be classified as linguistic, paralinguis-

tic, and extralinguistic [1]. Linguistic information is related to the language associated

with speech. Paralinguistic information in speech signal is nonlinguistic and nonverbal.

It refers to the information about speaker’s current attitudinal and emotional state. It is

similar to the linguistic information in the sense that ‘both the speaker and the listener are

aware of the intended message (communicative behavior)’. It is dissimilar to the linguistic

information in the sense that ‘it is not necessarily obviousto all human perceivers on some

universal basis. It is particular to the culture of the speaker, and its conventional interpre-

tation must be learned’. Extralinguistic information refers to the information present in

speech signal, ‘of which the speaker is not aware but the listener perceives it (informa-

tive behavior)’ [1]. Identity of the speaker and habitual factors such as voice quality, are

some of the ingredients that constitute the extralinguistic information. Advances in the

fields of automatic speech recognition and speech synthesishave helped in estimating

and reproducing the linguistic information present in natural speech. However, extraction

of the paralinguistic and the extralinguistic informationfrom acoustic speech signal is a

challenging task.

Speaking rate is one such paralinguistic feature, which hasan important role in hu-

man speech communication. It is a quantity which is proportional to the speed at which

the speaker produces speech. Speaking rate is a characteristic of both the speaker and

language, and can be changed consciously or subconsciouslyby the speaker. Deviation

from a normal speaking rate can easily be perceived by the listeners. The objective of this

work is to analyze the variations of acoustic features with changes in speaking rate, and

incorporate these variations in synthesis of speech at different speaking rates.

In order to emphasize the significance of speaking rate in human-to-human speech

communication, instances that are responsible for speaking rate change are briefly de-

scribed in section 1.1. In section 1.2, a study on the variations of speaking rate in human-

to-human speech and a study on the effect of speaking rate variations on speech systems.

1.1 Sources of speaking rate variation

It was shown by several studies that emotional state of the speaker influences the speaking

rate [2, 3]. It was observed that emotions like anger, fear, rage and happiness are asso-

ciated with high speaking rates, and emotions like boredom,sadness, sorrow and grief

are associated with low speaking rates. Speaking rate was observed to correlate strongly

with the activeness of the speaker, i.e. the more active the speaker is, the higher is his/her

speaking rate [4, 5]. Similar to emotion, stress was observed to influence the speaking rate

[6]. It was observed that high cognitive workload leads to a faster speaking rate [6]. Read

speech can be seen as a speech mode where the ideas to be expressed are completely pre-

pared and formulated before speech production starts. In contrast, in spontaneous speech

the formulation process and speech production are simultaneous. It was observed that

speaking rate is higher in read speech than in spontaneous speech [7]. The differences

in speaking rates of prose and poetry was observed in [8]. Thereadings of prose was

observed to have high speaking rates than poetry. The effect of speaking rate on the per-

ception of competence and benevolence of the speaker is analyzed in [9]. It was observed

that nonnormal (fast or slow) speaking rates are associatedwith competence and normal

speaking rate is associated with benevolence [5].

Frequently speakers unconsciously adapt their speaking rate to their dialogue partner.

It was observed that the information processing ability of the dialogue partner influences

the speaking rate [10]. For example, in a conversation involving infants, elderly peo-

age gender

attitude/competence

cultural background

discourse structureword frequency

emotion

stress

spoken text type

habitual speech rate

language proficiency

dialogue partner

Speaking rate

speech plannig

information structure

speech & hearing impairments

Fig. 1.1: Sources of speaking rate variation [13]

ple and persons with hearing difficulties, the speakers tend to reduce the speaking rate.

Speech in noisy conditions is less intelligible than in normal conditions. It was observed

that listeners prefer slow speech than fast speech in noisy conditions [11]. The language

proficiency of the listener was observed to affect the perception of speaking rate. It was

observed that learners of a foreign language feel native speakers use an exceptionally

rapid speaking rate [12]. Along with the above mentioned causes, the speaking rate was

observed to be influenced by various other factors like age, gender, speech, hearing im-

pairments etc. A comprehensive depiction of the sources of speaking rate variation is

shown in Fig. 1.1. Examples presented in this section illustrate the great range of situa-

tions and conditions in which speaking rate change can occur.

1.2 Spontaneous speech - a case study

Speech produced during human-to-human conversation is rich in terms of paralinguistic

information. Even a small variation from an expected pronunciation is an information

bearing element. Most of the speech systems assume that the users articulate clear, gram-

matically correct utterances, with orthodox pronunciation. When such a system is ex-

posed to human-to-human speech (spontaneous speech in the present context) the perfor-

mance greatly degrades. Also, development of systems that can tackle human-to-human

speech is desirable because of the availability of large amount of spontaneous speech

(which are present in forms such as lecture videos, telephone conversations and news sto-

ries). It was observed that speaking rate variation is an important factor that is responsible

for the degradation in the performance of automatic speech recognition (ASR) systems

[14]. In this section, variation of speaking rate in spontaneous speech is studied and a

review of the studies on the effect of speaking rate on performance of speech systems

(automatic speech recognition systems in particular), is presented.

1.2.1 Speaking rate variation in spontaneous speech

In order to study the variations of speaking rate in spontaneous speech, speech extracted

from class room lectures was chosen. The speech signals werecollected from four speak-

ers using a close speaking microphone and were sampled at 16000 Hz. The total duration

of the speech signals is six hours, where all speakers contribute equally.

General measurement methods

There have been two major approaches in measuring speaking rate. Each has its advan-

tages and limitations. The first represents the use of discrete categorization such as fast,

normal and slow to describe the speaking rate [15]. Such perceptually chosen classes

have been used in applications such as acoustic model selection [16, 17] and HMM nor-

malization [18] in ASR. Even though it matches human intuition, the boundaries between

these three categories are fuzzy. Most of the time, human knowledge is required to set the

boundaries, and hence it is difficult to devise a completely automated engineering solu-

tion. In the second approach, speaking rate is measured in a quantitative way by counting

the number of phonetic elements per second. Words, syllables [16], stressed syllables,

and phonemes [19] are all possible candidates. Syllables are a popular choice because

of the robustness in estimation of the syllable boundaries,and close resemblance to the

−3 −2 −1 0 1 2 3 0

τ−µτ

yσ=0.73

−3 −2 −1 0 1 2 3 0

τ−µτ

σ=0.66

−3 −2 −1 0 1 2 3 0

τ−µτ

σ=0.82

−3 −2 −1 0 1 2 3 0

τ−µτ

σ=0.70

Fig. 1.2: Mean subtracted histograms of syllable rate for spontaneous speech.

speech production characteristics [16, 20, 21]. In the present study, number of syllables

per second was chosen as an quantitative metric to measure speaking rate. Identifica-

tion of syllable boundaries was performed using syllable detection algorithm proposed by

Mermelstein [22].

Speaking rate variation

The variations of syllable rate (τ) were analyzed to illustrate the variations of speaking rate

in spontaneous speech. Histograms of syllable rate for two types of speech, namely, (i)

spontaneous speech and (ii) read speech were computed. Fig.1.2 shows the histograms

−3 −2 −1 0 1 2 3 0

τ−µτ

yσ=0.51

−3 −2 −1 0 1 2 3 0

τ−µτ

σ=0.21

−3 −2 −1 0 1 2 3 0

τ−µτ

σ=0.44

−3 −2 −1 0 1 2 3 0

τ−µτ

σ=0.33

Fig. 1.3: Mean subtracted histogram of syllable rate for read speech.

corresponding to spontaneous speech of four different speakers. Each plot in Fig. 1.2

corresponds to one speaker. Similarly, Fig. 1.3 shows the histograms corresponding to

read speech of four speakers. All the histograms are mean subtracted in order to preserve

emphasis on the variations rather than on the absolute valueof the syllable rate. From

Figs. 1.2 and 1.3, it can be observed that the histograms corresponding to spontaneous

speech are more broader than the histogram corresponding toread speech. This can be

observed from the values of standard deviation (σ) for all the histograms. The values of

σ corresponding to spontaneous speech are greater than that of read speech for all the

speakers. From these observations it can be inferred that variations of speaking rate are

higher in spontaneous speech than in read speech.

1.2.2 Effect on the performance of speech systems

In most of speech systems, the models developed during the training phase are built us-

ing speech data, which is well articulated, at an almost constant speaking rate, without

any mispronunciations. Examples of such databases which are widely used in building

speech systems like ASR and speaker recognition are, TIMIT [23], CMU-ARCTIC [24],

etc. When such a system is exposed to spontaneous speech, theperformance degrades

significantly [14]. This section reviews the studies on the effect of speaking rate on the

performance of speech systems. Sielgar et. al [21] studied the relation between word

error rate of an ASR system and speaking rate variations. Three different speaking rate

metrics were used:

(i) Word rate, defined as number of words per minute.

(ii) Phone rate, defined as number of phones per second.

(iii) Phone rate percentile, defined as the cumulative distribution function of the ob-

served phone duration [21].

The database used was Wall Street Journal corpus [25]. It wasobserved that word error

rate increased significantly when the word rate was greater (or lesser) than the mean word

rate by more than two standard deviations. Similarly, it wasobserved that when phone

rate was greater (or lesser) than the mean phone rate by more than one standard deviation,

then the word error rate increased significantly. Word recognition error for a large vocab-

ulary continuous speech recognition (LVCSR) system for fast speakers was analyzed in

[26]. It was observed that error rate increased by two to three folds for the speaker with

highest speaking rate. The possible causes can be inherent spectral differences, phone

omission and duration reduction [26]. These studies showedthat changes in speaking rate

impacts the performance of speech systems. Therefore, careful adaption to the changes

in speaking rate is essential to the performance of the speech systems like speech recog-

nition and speaker recognition. Also adapting the speakingrate to the listeners choice or

convenience improves the naturalness of speech synthesis systems.

1.3 Objective and scope of the work

The objective of this work is to analyze the variations of acoustic features with speak-

ing rate. The acoustic features related to excitation source are used. In order to analyze

the variations, speech utterances are collected from various speakers at different speaking

rates, namely, fast, normal and slow. The variations of the acoustic features corresponding

to speech at nonnormal (fast or slow) speaking rates are analyzed relative to the speech

at normal speaking rate. The effect of speaking rate on the perception of loudness is ana-

lyzed by conducting a perceptual loudness test, and analyzing the variations of an acoustic

loudness measure. The observed variations are incorporated in a uniform duration modi-

fication method, to synthesize speech at different speaking rates.

1.4 Organization of the thesis

The contents of this thesis are organized as follows:

In chapter 2, we review the studies on the effect of speaking rate variations on acoustic

features.

In chapter 3, we analyze the variations of two features related to excitation source at

different speaking rates. We observe and quantify the difference between the distributions

of the two features.

In chapter 4, we analyze the effect of speaking rate on the perception of loudness. A

perceptual loudness test on speech at different speaking rates was conducted. The results

from the perceptual tests are compared with variations of a loudness measure extracted

from speech signals.

In chapter 5, a method to synthesize speech at different speaking rates is proposed.

Variations in the acoustic features and durations of some ofthe sound units are incorpo-

rated in a duration modification method.

In chapter 6, we summarize the contributions of the present work, and highlight some

issues arising out of the study.

Chapter 2

Speaking rate - A Review

Depending on the type of data used for analysis, studies on the effects of speaking rate

on speech production mechanism can be broadly classified into two categories: (1) Stud-

ies based on articulatory dynamics, and (2) studies based onfeatures derived from the

acoustic speech signal.

2.1 Studies based on articulatory dynamics

In order to produce a particular type of speech sound, the articulators should follow a

particular sequence of movements. Changing the speaking rate involves producing an

acoustic output with the same linguistic information, in a shorter/longer duration. When

the speaking rate is changed, the articulators follow an almost similar path, but in a

shorter/longer duration. In order to accommodate both the durational changes and lin-

guistic information, as mentioned in [27], the speaker ‘mayvary the spatial magnitude of

articulatory movements [28, 29, 30, 31], or may adjust the speed of transition between

successive targets [32, 33], or may modify the overlap between successive articulatory

gestures by modifying their phrasing [34, 35, 36]’. ‘These changes may not be mutually

exclusive, and can interact with each other’ [37, 33, 38]. The effect of speaking rate vari-

ations on the dynamics of the articulators of the vocal tractwas studied extensively in

literature [39, 40, 41, 42]. The effect of speaking rate on tongue movements was studied

using electromyograph (EMG). It was observed that the EMG activity associated with

the tongue body movements during vowel production decreased during fast speech, while

the activity associated with the production of both labial and alvelor stop consonants in-

creased with increase in the speaking rate [40, 43]. The decrease in the activity implies

either a decrease in the articulatory displacement, or a decrease in the speed of the ar-

ticulatory movements, or both [40, 44]. The dynamics of six articulators (jaw, both lips,

tongue tip, blade, and dorsum) were studied in [41] for different speaking rates. The ar-

ticulatory data was acquired using an electromagnetic midsagittal articulograph (EMA).

Close examination of the EMA data showed that the shapes of the articulatory trajecto-

ries became more complex in the case of slow speech [41]. Thisimplies that articulatory

trajectories are partially influenced by the speaking rate.It was showed that the tongue

diverged less from the ‘centroid’ or ‘rest’ position at fastspeaking rates than at normal

speaking rates [42]. Variation in the dynamics of the articulators with speaking rate was

observed to be dependent on the type of the articulator [42].It was observed that the

variation in the tongue movement was greatest of all the articulatory movements [42].

The above mentioned studies showed that speaking rate affects the dynamics of the

articulators. However, in practice, acquisition of data related to articulatory movements is

difficult. Hence, the analysis of speaking rate using the acoustic speech signal is preferred.

2.2 Studies based on acoustic features

2.2.1 Suprasegmental level

Suprasegmental features for studying the effect of speaking rate are related to discourse

prosody, pauses, and pitch accents. The features related todiscourse prosody are sizes

(measured in terms of number of syllables) and durations of syllables (SYLL), prosodic

word (PW), minor phrase (MIP), intonation phrase (IP), prosodic group (PG), turn (TN),

and discourse (DI) [45]. Studies showed that the durations of PW were significantly

different for three speaking rates (slow, normal, and fast), while the durations of other

discourse prosody features changed very little. It was observed that the sizes of the IP

and MIP were affected by speaking rate, whereas the sizes of PW changed very little with

the speaking rate [45]. It was shown that the speaking rate affected the characteristics of

pauses significantly [46, 47]. These studies showed that thenumber of pauses increased

when the speaking rate decreased. It was also observed that changes in the durations of

the pauses were speaker-dependent. It was showed that therewas no significant change

in the average duration of pauses among different speakers [47]. Effect of speaking rate

on the number of prosodic breaks and characteristics of pitch accents (number and type)

was studied in [46]. It was observed that fast speech had lessnumber of prosodic breaks

compared to normal and slow speech. Observation of the characteristics of pitch accents

showed that the number of pitch accents was low for fast speech and high for slow speech.

Also, fast speech showed more monotonal characteristic than bitonal, and slow speech

showed more bitonal characteristic. This may be due to simplicity of the monotonal

speech compared to bitonal speech.

2.2.2 Segmental level

Segmental features observed for studying the effects of speaking rate include voice onset

time (VOT) [48, 49, 50, 51, 52], durations of different sound units [53, 34], and spectral

features [39]. Studies showed a systematic increase/decrease in syllable durations when

the speaking rate was decreased/increased [53]. Changes in the syllable duration are

due to changes in the duration of the vowel and that of the VOT of the syllable. This

observation is consistent among many studies. Some studiesobserved that the amount

of increase/decrease in VOT was speaker-specific [52]. Variation of durations of vowels,

consonants, and transition regions was studied in [39]. Of the three types, the durations

of vowels showed the most change. It was observed that the variation of duration of

transition regions with the speaking rate in a consonant-vowel (CV) context depended

on the type of the consonant, and that it was directly proportional to the variation in the

duration of the syllable [49]. Variation of transition durations of two syllables/ba/ and

/wa/ was analyzed with speaking rate changes [49]. Change in transition duration of the

syllable/wa/ was greater than that in the syllable/ba/. Increase in the syllable duration

of /ba/ was almost entirely due to increase in the post-transition region, whereas for/wa/

the increase in the syllable duration was also due to an increase in the transition duration

[49]. Variation of durations of vowels, pauses, consonants, and transition regions for

Hindi language was studied in [54]. It was observed that the durations of vowels and

pauses varied significantly, whereas the durations of consonants and transition regions

varied very little. Formant frequencies have also been observed for studying the effects

of speaking rate. Studies showed that formant frequencies in stable regions were not

affected much by the speaking rate. But the formant frequenciesin the transition regions

(i.e., onset frequency of the formant transition) varied significantly with speaking rate

[39]. It was shown that theF2 onset frequency (in CV context) was closer to the vowel

midpoint frequency in the fast speech than in the slow speech[39]. The rate of change of

F2, computed asF2mid−F2onset

T , whereT is the duration of the transition region, was observed

to change significantly with speaking rate, although the amount of change was observed

to be speaker-dependent.

Speaking rate changes the acoustic features of production nonuniformly at various

levels. These changes occur at the subsegmental level (lessthan a pitch period) also,

which is mostly due to excitation of the vocal tract system. For producing natural sound-

ing synthetic speech at different speaking rates, the overall duration, and the segmental

level variations need to be captured and incorporated during synthesis [55, 54]. Like-

wise, variations of the fundamental frequency and changes in the formants of sound units

also need to be incorporated during the synthesis. In addition, changes in the excitation

characteristics at the subsegmental level may also influence the quality of the synthesized

speech.

Chapter 3

Effect of speaking rate on excitation

source features

In this chapter, we present the study on effect of speaking rate on features related to ex-

citation source. Two acoustic features related to excitation source are estimated from

speech signal. This is achieved by removing the effect of vocal tract system on the speech

signal. Intra speaker variations in the excitation source features, caused by the speaking

rate changes are analyzed. In section 3.1, we discuss the method of extraction of the exci-

tation source features from speech signal. In section 3.2, description of speech utterances

recorded at various speaking rates is discussed. Sections 3.3 and 3.4 discuss the variations

of the two excitation source features.

3.1 Extraction of excitation source features

Features of excitation source can be extracted from the speech signal by removing the

influence of the vocal tract system on the acoustic signal. Three features of excitation

source are used in this study. Two of them are obtained from the zero-frequency filtering

of the speech signal [56]. The third feature is derived from the Hilbert envelope of the lin-

ear prediction (LP) residual. The zero-frequency filtered signal and the Hilbert envelope

of the LP residual are obtained by processing the speech signal to reduce the influence of

0 0.1 0.2 0.3 0.4 0.5−1

0 0.1 0.2 0.3 0.4 0.5−400

0 0.1 0.2 0.3 0.4 0.5−0.2

0 0.1 0.2 0.3 0.4 0.50

Time (s)

Fig. 3.1: (a) Segment of a speech signal from VO-QUAL’03 [57] database, (b) zero-frequency filteredsignal, (c) differenced EGG signal and epoch loca-tions marked by arrows, (d) strength of excitationcalculated from the zero-frequency filtered signal,and (e) contour of instantaneous F0.

the vocal tract system. Extraction of the features related to zero-frequency filtering are

explained in this section, and extraction of the feature related to the Hilbert envelope of

the LP residual is discussed in Sec. 4.2

During the production of voiced speech, the excitation to the vocal tract system can be

approximated by a sequence of impulses of varying amplitudes. The effect of discontinu-

ity due to the impulse-like excitation is reflected across all the frequencies, including 0 Hz

or the zero frequency [56, 58]. The effect of discontinuity due to the impulse-like exci-

tation is clearly visible in the output of narrowband filtering (at any frequency) of speech

signal. The advantage of choosing a zero-frequency filter isthat the output is not affected

by the characteristics of the vocal tract system which has resonances at much higher fre-

quencies. Therefore the zero-frequency filtering helps in emphasizing the characteristics

of the excitation [56]. A zero-frequency resonator is an infinite impulse response filter

with a pair of poles located on the unit circle. A cascade of two such resonators is used

to provide sharper cut-off to reduce the effect of resonances of the vocal tract system.

The following steps are involved in processing speech signal to derive the zero-frequency

filtered signal [56, 58].

1. The speech signals[n] is differenced to remove any slowly varying component

introduced by the recording device.

x[n] = s[n] − s[n− 1] (3.1)

2. The differenced speech signalx[n] is passed through a cascade of two ideal zero-

frequency (digital) resonators. That is

y0[n] = −4∑

aky0[n− k] + x[n] (3.2)

wherea1 = −4, a2 = 6, a3 = −4 anda4 = 1. The resulting signaly0[n] grows

approximately as a polynomial function of time.

3. The average pitch period is computed using the autocorrelation function of 30 ms

segments ofx[n].

4. The trend iny0[n] is removed by subtracting the local mean computed over the

average pitch period, at each sample. The resulting signal

y[n] = y0[n] −1

2N + 1

m=−N

y0[n+m] (3.3)

is the zero-frequency filtered signal. Here 2N + 1 corresponds to the number of samples

in the window used for trend removal. The choice of the windowsize in not critical as

long as it is in range of one to two pitch periods. Fig. 3.1(b) shows the filtered signal

of the speech segment shown in Fig. 3.1(a). It was shown that the instants of positive-

to-negative zero crossings (PNZCs) correspond to the instants of significant excitation in

voiced speech, calledepochs[56]. The locations of PNZCs of the filtered signal are shown

in Fig. 3.1(c). There is close agreement between the locations of the strong positive peaks

of the differenced electroglottograph (DEGG) signal and the instantsof PNZCs derived

from the filtered signal. The instantaneous fundamental frequency (which is referred to

as instantaneousF0 in the present work) at each epoch is derived by computing therecip-

rocal of the time interval between the current epoch and the next epoch [59]. Fig. 3.1(e)

shows the instantaneousF0. Since the effect due to an impulse is spread uniformly across

the frequency range, the strength of impulses can be derivedfrom a narrowband around

the zero frequency. Hence the information about the strength of excitation can also be

derived from the zero-frequency filtered signal. It was observed that the slope of the

zero-frequency filtered signal around PNZCs gives a measureof strength of excitation

[60]. The slope is measured by computing the difference between the negative sample

value and positive sample value on either side of the epoch, and is denoted as strength of

excitation (ǫ) ([60]). Fig. 3.1(d) shows the plot ofǫ, derived from the filtered signal in

Fig. 3.1(b). The plot ofǫ shows a trend similar to the DEGG signal (Fig. 3.1(c)).

3.2 Speech material

The speech database used in the present study consists of 10 English sentences which are

chosen from TIMIT dataset [23]. Each sentence was uttered by25 male speakers at three

different speaking rates, namely, fast, normal, and slow. The speakers were undergraduate

and graduate students, aged between 20 and 30 years. All the speakers spoke Indian

English, and the native language of each speaker was one among Telugu, Hindi, Kannada,

and Tamil. The speakers were guided to listen to samples of fast and slow utterances

before they produced utterances at different speaking rates. The objective was to help the

speakers to produce speech at different speaking rates while maintaining naturalness of the

speech. Without the help of a reference, some speakers were unable to produce speech

at different speaking rates. A speaker with naturally slow speaking rate produced fast

speech, which was similar to the normal speech of other speakers. Similar behavior was

observed with speakers who had naturally fast speaking rates. A total of 750 utterances

(250 utterances for each speaking rate) were collected. Thespeech signals were sampled

at 8 kHz.

Analysis of the recorded utterances was done to determine whether the speakers were

able to produce speech at the three different speaking rates. Syllable rate, which is defined

0 2 4 6 8 10 120

Syllables/second

fastnormalslow

Fig. 3.2: Distribution of the measure of speak-ing rate for fast, normal, and slow utterances.

as number of syllables per second, is used to measure the speaking rate. The syllable

rate of an utterance is estimated by computing the ratio of number of syllables in the

utterance and duration of the utterance. Since the text corresponding to the utterances is

available, the number of syllables in an utterance is obtained from the corresponding text.

Fig. 3.2 shows the distributions of syllable rates of fast, normal, and slow utterances. The

distributions show a significant change between the three speaking rates.

3.3 Variation of instantaneousF0

Fig. 3.3 shows the distributions of instantaneousF0 for four male speakers, chosen at

random from the set of 25 speakers. The distributions of the instantaneousF0 for four

speakers are examined to illustrate differences among individuals, indicating the speaker-

specific nature of these variations. It is observed that the distribution of instantaneousF0

does discriminate between fast, normal, and slow utterances of the speakers, although the

amount of discrimination is speaker dependent. For speakers in Figs. 3.3(a) and (b), there

is good discrimination between distributions of the instantaneousF0 for fast, normal, and

slow utterances. Discrimination can be observed from the mean of the distributions, and

from their spreads. For the speaker in Fig. 3.3(c) there is very little difference between

60 120 180 2400

F0(Hz)

60 120 180 2400

F0(Hz)

60 120 180 2400

F0(Hz)

60 120 180 2400

F0(Hz)

Fig. 3.3: Distributions of instantaneous F0 for 4 male speakers.In each case, the solid (‘—’), the dashed (‘- - -’), and the dotted(‘· · · ’) lines correspond to normal, fast, and slow utterances, re-spectively.

distributions of fast and normal utterances, but some discrimination between the distri-

butions of slow and normal utterances. The distributions shown in Fig. 3.3(d) are very

close to each other. Some speaker-specific characteristicscan be inferred from the distri-

butions shown in Fig. 3.3. For a speaker with a naturally fastspeaking rate, the distinction

between instantaneousF0 of his/her fast and normal speech will be less. Similarly, for

a speaker with a naturally slow speaking rate, the instantaneousF0 of slow and normal

speech are similar. Some speakers are able to produce speechat three different speaking

rates while maintaining intelligibility and naturalness.In most cases, speech uttered in

at least one of the nonnormal (fast or slow) speaking rates showed significant difference

from the speech uttered in normal and the other nonnormal (slow or fast) speaking rate.

In order to evaluate the variation of instantaneousF0 with speaking rate for all the

25 speakers in the dataset, Kullback-Leibler (KL) divergence [61] is used. When two

distributions are described by univariate Gaussian probability density functions, the KL

divergence between the two distributions is given by [62]

dKL(A,B) =12

σB2+σB

+12{µA − µB}

, (3.4)

whereµA andσA denote the mean and the standard deviation, respectively, of the samples

in setA, while µB andσB denote the corresponding quantities for the samples in setB.

Also computed isµA − µB, which is the difference of the mean valuesµA andµB. In

this study, the samples in setsA andB are the values of instantaneousF0. Let us first

consider the case of fast and normal utterances. Consider the values of instantaneousF0

obtained from normal and fast utterances as two separate classes. When the values of

the instantaneousF0 in bothA andB are from either normal or fast utterances, then it

is the case of intra-class comparison. Likewise, inter-class comparisons are those where

the values of the instantaneousF0 in A andB are derived from the fast (normal) and

normal (fast) utterances, respectively, of a speaker. BothdKL(A,B) andµA − µB should

be less for the intra-class comparisons than for the inter-class comparisons. The ordered

pair (µA − µB, dKL(A,B)) is used to distinguish between normal and fast utterancesof a

speaker, as described below.

LetN denote the set of values of instantaneousF0 of a given speaker, derived from the

10 utterances collected at normal speaking rate. LetN1 andN2 denote two distinct subsets

ofN , such that the values of instantaneousF0 in each subset are derived from 5 utterances

at normal speaking rate. For the same speaker, letF (S), F1 (S1), andF2 (S2) denote

the corresponding sets derived from the utterances at fast (slow) speaking rate. For each

speaker, the following ordered pairs are computed: (a) (µFi−µN j , dKL(Fi ,N j)), for i = 1, 2,

and j = 1, 2, (b) (µF −µN , dKL(F ,N)), (c) (µFi −µF j , dKL(Fi,F j)) for i = 1, 2, j = 1, 2, and

i , j, and (d) (µNi −µN j , dKL(Ni ,N j)) for i = 1, 2, j = 1, 2, andi , j. The ordered pairs in

(a) and (b) denote the inter-class comparisons within a speaker, while those in (c) and (d)

denote the intra-class comparisons within the speaker. Each ordered pair can be plotted

as a point in a two-dimensional plane. For each speaker 5 points are computed due to

−35 −20 0 20 35 0

µA−µ

d KL(A

−35 −20 0 20 35 0

µA−µ

d KL(A

−35 −20 0 20 35 0

µA−µ

d KL(A

−35 −20 0 20 35 0

µA−µ

d KL(A

Fig. 3.4: Variation of the instantaneous F0 with speaking rate.(a) and (b) show the results of intra-class and inter-class compar-isons, respectively, for fast vs normal case. (c) and (d) show theresults of intra-class and inter-class comparisons, respectively, forslow vs normal case.

inter-class comparisons and 2 points are computed due to intra-class comparisons (since

dKL(A,B)=dKL(B,A)). Thus, for 25 speakers, 125 points are computed due to inter-class

comparisons and 50 points are computed to intra-class comparisons. Figs. 3.4(a) and (b)

show the points corresponding to speaker-specific intra-class and inter-class comparisons,

respectively, in the case of fast vs normal utterances for 25male speakers. For comparison

of utterances recorded at slow and normal speaking rates, the following ordered pairs are

computed for each speaker: (a) ((µSi−µN j , dKL(Si,N j)), for i = 1, 2, j = 1, 2, (b) (µS−µN ,

dKL(S,N)), (c) (µSi − µS j , dKL(Si,S j)) for i = 1, 2, j = 1, 2, andi , j, and (d) (µNi − µN j ,

dKL(Ni ,N j)) for i = 1, 2, j = 1, 2, andi , j. The ordered pairs in (a) and (b) denote the

inter-class comparison points within a speaker, while those in (c) and (d) denote the intra-

class comparison points within the speaker. The slow vs normal intra-class and inter-class

comparison points are plotted in Figs. 3.4 (c) and (d), respectively.

It is observed from Fig. 3.4 that the points due to intra-class comparison (Figs. 3.4(a)

and (c)) are closer to the origin than the points due to inter-class comparison (Figs. 3.4(b)

and (d)). The inter-class comparison points have more spread than the intra-class com-

parison points. Most of the fast vs normal inter-class comparison points (Fig. 3.4(b)) lie

in the first quadrant (positive abscissa and positive ordinate), which implies thatµF > µN

in most cases. The spread of points shows that there is variation in the distributions of

instantaneousF0 of fast and normal speech. Unlike the fast vs normal inter-class compar-

ison points, the slow vs normal inter-class comparison points (Fig. 3.4(d)) are distributed

both in the first quadrant and the second quadrant (negative abscissa and positive ordi-

nate). Also slow vs normal inter-class comparison points have larger spread than slow vs

normal intra-class comparison points (Fig. 3.4(c)). Theseobservations imply that almost

all the speakers increase their instantaneousF0 when speaking fast, but when the speak-

ing rate is decreased, the instantaneousF0 increases for some speakers and decreases for

some others.

A cumulative probability density function (CPDF) is used toidentify the number of

cases in which instantaneousF0 increases (or decreases) when speaking rate is increased

(or decreased). Letf (x) is a CPDF computed from a set X. The percentage of samples

in X which are less than a numberk is given by f (k) × 100. The difference between

mean instantaneousF0 of fast utterance and mean instantaneousF0 of the corresponding

normal utterance is computed which is denoted as∆µFN. Therefore, a total of 250 values

are computed. A Gaussian probability density function is computed from the values of

∆µFN. The CPDF computed from the Gaussian probability density function is shown

in Fig. 3.5(a). It can be observed from Fig. 3.5(a) that CPDF is equal to 0.25 when

∆µFN = 0. This implies that for 25 % of the values∆µFN ≤ 0 and for 75 % of the values

∆µFN > 0. This observation shows that in 75 % of the cases instantaneousF0 increases

when speaking rate is increased and in 25 % of the cases instantaneousF0 decreases

when speaking rate is increased. Similarly, a CPDF is constructed from the values of

difference between mean instantaneousF0 of a normal utterance and the corresponding

slow utterance (denoted by∆µNS), which is shown in Fig. 3.5(b). It can be observed from

Fig. 3.5(b) that CPDF is equal to 0.5 when∆µNS = 0. This implies that form 50 % of

the values∆µNS ≤ 0 and for 50 % of the values∆µNS > 0. This observation shows that

in 50 % of the cases instantaneousF0 increases when speaking rate is increased and in

−40 −20 0 20 40 0

∆µFN

−40 −20 0 20 40 0

∆µSN

Fig. 3.5: Cumulative probability density function of difference be-tween mean values of instantaneous F0 of (a) fast utterancesand normal utterances, and (b) slow utterances and normal ut-terances.

the other 50 % of the cases instantaneousF0 decreases when speaking rate is increased.

This clearly shows the speaker-specific nature in the changeof instantaneousF0 when

speaking rate is decreased.

This observation has implication in synthesis of speech at different speaking rates,

because unlike duration, where increase (decrease) in the speaking rate results in a corre-

sponding decrease (increase) in the duration, the instantaneousF0 doesn’t follow specific

trend. For synthesizing speech at different speaking rates, the instantaneousF0 has to be

modified suitably.

3.4 Variation of strength of excitation

Fig. 3.6 shows the distributions ofǫ for the four male speakers (same as used in Fig. 3.3).

From Fig. 3.6 it is observed thatǫ does vary with speaking rate. The degree of variation is

speaker-dependent. The general trend observed across all the speakers is that ¯ǫ (where ¯ǫ

denotes the mean value ofǫ) of fast speech is less than that of normal and slow speech. For

some speakers, ¯ǫ of normal speech is less than that of slow speech (Figs. 3.6(b) and (d)),

whereas for some others ¯ǫ of slow speech is less than that of normal speech(Figs. 3.6(a)

and (c)). Note that this is a speaker-specific property. The spread of distribution of the

0 5 10 15 20 25 300

ncy (a)

0 5 10 15 20 25 300

Fig. 3.6: Distributions of strength of excitation (ǫ) for 4 malespeakers. In each case, the solid (‘—’), the dashed (‘- - -’), andthe dotted (‘· · · ’) lines correspond to normal, fast, and slow utter-ances, respectively.

values ofǫ for fast speech is less than that of normal and slow speech. This implies that

the variation ofǫ is less for fast speech compared to slow and normal speech. The simi-

larity between the distributions ofǫ of slow and normal speech is more than the similarity

between the distributions ofǫ of fast and normal speech. For speakers in Figs. 3.6(a), (b)

and (c) there is good discrimination between the distributions ofǫ of fast, normal, and

slow speech, but, for the speaker in Fig. 3.6(d) the discrimination is very less. In order to

evaluate the variation ofǫ with speaking rate, the KL divergence is used. The evaluation

procedure is similar to the one used in the case of instantaneousF0, discussed in Sec. 3.3.

Figs. 3.7(a) and (b) show the points corresponding to speaker-specific intra-class and

inter-class comparisons, respectively, for the case of fast vs normal utterances of all the 25

male speakers. In a similar fashion, the speaker-specific intra-class and inter-class com-

parison points are computed for slow vs normal utterances, and are plotted in Figs. 3.7(c)

and (d). It is observed from Fig. 3.7 that the intra-class comparison points (Figs. 3.7(a)

and (c)) are closer to the origin than the inter-class comparison points (Figs. 3.7(b) and

−20 −10 0 10 20 0

µA−µ

d KL(A

−20 −10 0 10 20 0

µA−µ

d KL(A

−20 −10 0 10 20 0

µA−µ

d KL(A

−20 −10 0 10 20 0

µA−µ

d KL(A

Fig. 3.7: Variation of the strength of excitation (ǫ) with speakingrate. (a) and (b) show the results of intra-class and inter-classcomparisons, respectively, for fast vs normal case. (c) and (d)show the results of intra-class and inter-class comparisons, re-spectively, for slow vs normal case.

(d)). The inter-class comparison points have more spread than the intra-class comparison

points. Most fast vs normal inter-class comparison points (Fig. 3.7(b)) lie in the second

quadrant, which implies that the average value ofǫ is lower for fast speech compared to

normal speech, for most of the speakers. Unlike the fast vs normal inter-class comparison

points, the slow vs normal inter-class comparison points (Fig. 3.7(d)) lie in both the first

and the second quadrants. This implies that the trend observed for ǫ in slow vs normal

case is different from that observed in fast vs normal case.

cases in whichǫ decreases (or increases) when speaking rate is increased (or decreased).

Similar to the case of instantaneousF0 (shown in Sec. 3.3), two CPDFs are computed

from the difference between values of ¯ǫ of a fast utterance and the corresponding normal

utterance (denoted by∆µFN), and values of ¯ǫ of a normal utterance and the corresponding

slow utterance. Fig. 3.8(a) and (b) shows the CPDFs constructed from the values of

−40 −20 0 20 40 0

∆µFN

−40 −20 0 20 40 0

∆µSN

Fig. 3.8: Cumulative probability density function of difference be-tween mean values of strength of excitation at epoch (ǫ) of (a)fast utterances and normal utterances, and (b) slow utterancesand normal utterances.

∆µFN and∆µNS. It can be observed from Fig. 3.8(a) that CPDF is equal to 0.75when

∆µFN = 0. This implies that for 75 % of the values∆µFN ≤ 0 and for 25 % of the values

∆µFN > 0. This observation shows that in 75 % of the casesǫ decreases when speaking

rate is increased and in 25 % of the casesǫ increases when speaking rate is increased.

Similarly, from the CPDF constructed from the values of∆µNS (shown in Fig. 3.8(b)), it

can be observed that CPDF is equal to 0.5 when∆µNS = 0. This implies that form 50 %

of the values∆µNS ≤ 0 and for 50 % of the values∆µNS > 0. This observation shows that

in 50 % of the casesǫ increases when speaking rate is decreased and in the other 50% of

the casesǫ decreases when speaking rate is decreased. This observation matches with the

observations made in the case of the instantaneousF0.

3.5 Summary

In this chapter, we presented the study on the effect of speaking rate on two excitation

source features namely: (i) instantaneous fundamental frequency and (ii) strength of ex-

citation at epoch. Both the instantaneousF0 and strength of excitation at epoch (ǫ) are

estimated from the speech signal by passing through a zero frequency resonator. Observa-

tions of normalized frequency distributions of instantaneousF0 for various speaking rates

showed that, when speaking rate is increased almost all the speakers increased their in-

stantaneousF0, whereas the change in instantaneousF0 when speaking rate is decreased

is speaker dependent. Observations of normalized frequency distributions ofǫ for various

speaking rates showed that when speaking rate is increasedǫ decreases for most of the

speakers, but when speaking rate is decreased the change inǫ is speaker dependent. In

order to generalize these observations for large number of speakers, KL-divergence was

used. The utterances corresponding to a speaker were grouped into fast, normal and slow

classes. Intra-class and inter-class comparison points between a nonnormal (fast or slow)

speaking rate and normal speaking rate were computed. Variations of intra-class and

inter-class comparison points seemed to correlate well with the observations. A cumula-

tive probability density function was used to quantify the number of instances in which

instantaneousF0 increases or decreases when speaking rate is changed. When speaking

rate is increased, in 75 % of the cases instantaneousF0 was observed to increase and in

25 % of the cases instantaneousF0 was observed to decrease. Whereas, when speaking

rate is decreased in 50 % of the cases instantaneousF0 was observed to increase and in 50

% of the cases instantaneousF0 was observed to decrease. Similar results were observed

in the case of strength of excitation of epoch.

Chapter 4

Effect of speaking rate on loudness

Loudness is an important feature of voice-quality present in human speech. Variation

in the degree of loudness conveys nonlinguistic information like emotional state of the

speaker and emphasis on particular regions in speech utterances. In general we perceive

fast speech is louder than slow speech. In this chapter, we study the effect of speaking

rate on the perception of loudness. This is performed by conducting a perceptual loudness

tests and analyzing the variations of an objective loudnessmeasure extracted from speech

signals.

In section 4.1, we describe the perceptual loudness test. Sections 4.2 and 4.3 de-

scribes the extraction method and presents the study on the variation of loudness measure

extracted from speech signal, respectively. In section 4.4, a modified loudness measure is

proposed and its variations with speaking rate is analyzed.

4.1 Perceptual evaluation of loudness of speech at differ-

ent speaking rates

Perceptual evaluation of loudness was carried out by conducting subjective tests with 6

listeners in the age group of 20-23 years. The tests were conducted in a laboratory en-

vironment by playing the speech signals through headphones. For perceptual evaluation,

a subset of the database (described in Sec. 3.2) was chosen. The subset contains speech

utterances spoken by 6 male speakers (i.e., 60 utterances ateach speaking rate). Two

speech files, one at a normal speaking rate and the other at a fast speaking rate, were

played out in succession. Both the utterances contained thesame sentence, and were

spoken by the same speaker. The listeners were asked to mark ‘F’ (or ‘N’), if they per-

ceived the utterance spoken at fast (or normal) speaking rate as the louder of the two. If

the listeners were unable to distinguish between the loudness of the two utterances in the

pair, they were asked to mark ‘X’. Sixty pairs of utterances containing fast and normal

utterances were used for listening. The same procedure was used to compare the loudness

of utterances spoken at normal and slow speaking rates. In this case, the listeners were

asked to mark the louder one as ‘S’ or ‘N’ corresponding to slow or normal speaking rate,

respectively. Figs. 4.1(a) and (b) show the results of perceptual evaluation.

From Figs. 4.1(a) and (b), it is observed that the listeners were able to distinguish

between the loudness of fast and slow speech when compared with normal speech (per-

centages of ‘X’ in Figs. 4.1(a) and (b) are 21 and 19, respectively). In the case of fast

vs normal speech, a significant number of utterances of fast speech have been marked as

louder (in 73 % of the cases as shown in Fig. 4.1(a)). By contrast, the loudness scores of

normal and slow utterances in the case of normal vs slow speech are close to each other

(43 % and 33 %, respectively in Fig. 4.1(b)). This study showsthat fast speech sounds

louder than normal speech in most cases, whereas the evidence is insufficient to show that

normal speech is louder than slow speech. This implies that perceptually, fast speech is

louder than normal speech. Also a change in the perception ofloudness takes place when

the speaking rate is changed (because of low percentage of ‘X’ in Figs. 4.1(a) and (b)).

The high loudness scores for fast speech can either be due speech production character-

istics or speech perception characteristics. An analysis of the cause of this behavior is

presented in the next section.

F N X0

100(a)

N S X0

100(b)

Fig. 4.1: Evaluation of the effect of speaking rate on the percep-tion of loudness. Results of comparison of (a) fast speech andnormal speech, and (b) normal speech and slow speech.

4.2 Extraction of loudness measure from the Hilbert en-

velope of the LP residual

The strength of the impulse-like excitation (also called the strength of excitation in ([63]))

is expressed in terms of the loudness measure defined byη = σµ. Hereµ denotes the mean

of the samples of the Hilbert envelope (HE) of the LP residualin a short interval around

the instants of significant excitation, andσ denotes the standard deviation of the samples

of the HE ([63]). Note that this loudness measure (η) is different from the strength of

excitation (ǫ) at an epoch defined in Sec. 3.4. The Hilbert enveloper[n] of the LP residual

e[n] is given by

r[n] =√

e2[n] + e2H[n], (4.1)

whereeH[n] denotes the Hilbert transform ofe[n]. The Hilbert transformeH[n] is given

eH[n] = IFT(EH(ω)), (4.2)

where IFT denotes the inverse Fourier transform, andEH(ω) is given by ([64])

EH(ω) =

+ jE(ω), ω ≤ 0

− jE(ω), ω > 0.(4.3)

0 0.1 0.2 0.3 0.4 0.5−1

0 0.1 0.2 0.3 0.4 0.50

Time(s)

Fig. 4.2: (a) Segment of a speech signal used from VOQUAL’03speech database [57], (b) 12th order LP residual, (c) Hilbert en-velope of the LP residual, and (d) contour of η extracted from theHilbert envelope.

Here E(ω) denotes the Fourier transform of the signale[n]. Fig. 4.2(c) shows the

Hilbert envelope of the LP residual (Fig. 4.2(b)) of the speech segment shown in Fig. 4.2(a).

A 12th order LP analysis is performed on each frame of 20 ms with a frame shift of 5 ms

to compute the LP residual. The sampling frequency of the signal is 8 kHz. The impulse-

like feature of excitation can be observed clearly from the Hilbert envelope of the LP

residual. Comparative analysis of soft, normal, and loud speech was made and reported

in [63]. The Hilbert envelope of the LP residual around the epoch locations is sharper in

the case of loud speech, compared to soft and normal speech. This behavior is illustrated

in Fig. 4.3, where the plots are obtained by overlapping the short segments of the Hilbert

envelope of the LP residual around each epoch location. The sharpness of the Hilbert

envelope of the LP residual can be captured by the parameterη, which is computed using

a 3 ms segment of the Hilbert envelope around each epoch. The mean value ofη derived

from the utterances of loud speech was observed to be greaterthan that of normal and soft

speech. Fig. 4.2(d) illustrates the contour of values ofη for the segment of speech signal

shown in Fig. 4.2(a).

0.5 1 1.5 2 2.50

0.10.20.30.40.50.60.70.80.9

Time (ms)

0.5 1 1.5 2 2.50

0.10.20.30.40.50.60.70.80.9

Time (ms)

0.5 1 1.5 2 2.50

0.10.20.30.40.50.60.70.80.9

Time (ms)

Fig. 4.3: Overlapping segments of the Hilbert envelope of the LP residual in the vicinity of epochlocations for (a) soft, (b) normal, and (c) loud utterances. ([63])

4.3 Variation of loudness measure

Figs. 4.4(a) and (c) show the distributions ofη for two male speakers. It is observed from

Figs. 4.4(a) and (c) thatη varies very little with speaking rate, even though the perceptual

loudness scores show that fast speech is perceived to be louder than normal speech. In

order to evaluate the variations ofη with speaking rate, we follow an analysis procedure

similar to the one used in the cases of instantaneousF0 andǫ. Figs. 4.5(a) and (b) show the

points corresponding to speaker-specific intra-class and inter-class comparisons, respec-

tively, for the case of fast vs normal utterances for 25 male speakers. Likewise, Figs. 4.5(c)

and (d) show the speaker-specific intra-class and inter-class comparisons, respectively, in

the case of slow vs normal utterances. The intra-class and inter-class comparison points

(Figs. 4.5(a) and (b), respectively) in the case of fast vs normal utterances have almost

equal spread. The same behavior is observed in the case of slow vs normal utterances.

But the perception studies in Sec. 4.1 indicated that fast speech is perceived to be louder

than normal speech in most cases. This implies that the loudness measure (η) doesn’t

change significantly when the speaking rate is changed, and that the change in the per-

ception of loudness may be caused by other factors.

0 0.5 1 1.50

0 50 100 150 200 2500

0 0.5 1 1.50

0 50 100 150 200 2500

Fig. 4.4: Distributions of loudness measure (η) and proposedmeasure (ηp) of perceptual loudness for two male speakers,shown in (a), (c) and (b), (d), respectively. In each case, the solid(‘—’), the dashed (‘- - -’), and the dotted (‘· · · ’) lines correspond tonormal, fast, and slow utterances, respectively.

4.4 Proposed measure of perceptual loudness

A parameter which measures the perception of loudness in thecase of speech at different

speaking rates is proposed in this section. Speech at higherspeaking rates is observed to

have higher instantaneousF0 than that of the normal and slow speaking rates. The instan-

taneousF0 at an epoch is defined as the reciprocal of the duration between two successive

epochs. Perception of loudness of the output signal depends, along with other factors, on

the energy emitted per unit time. The energy per unit time canbe increased by increasing

the sharpness of excitation or by increasing the number of excitations with equal sharp-

ness per unit time. Speech at higher speaking rates will havemore number of excitations

per unit time than the speech spoken at normal and slow speaking rates. More excitations

in a unit time will result in a signal that has greater energy.This may be the reason for

perceiving fast speech as louder than normal and slow speech. A parameter to measure

−0.1 −0.05 0 0.05 0.1 0

µA−µ

d KL(A

−0.1 −0.05 0 0.05 0.1 0

µA−µ

d KL(A

−0.1 −0.05 0 0.05 0.1 0

µA−µ

−0.1 −0.05 0 0.05 0.1 0

µA−µ

d KL(A

Fig. 4.5: Variation of loudness measure (η) with speaking rate.(a) and (b) show the results of intra-class and inter-class compar-isons, respectively, for fast vs normal case. (c) and (d) show theintra-class and inter-class comparisons, respectively, for slow vsnormal case.

the perceptual loudness in the case of speech at different speaking rates is obtained by

normalizing the loudness measure (η) by the pitch period. The measure is defined as

ηp =η

t0, whereη is considered as a measure of the loudness at an instant of significant

excitation, andt0 is pitch period (measured in seconds) at that instant. Figs.4.4(b) and (d)

show the distributions ofηp for the two male speakers of Figs. 4.4(a) and (c). From these

figures it is observed that discrimination between the distributions ofηp is better than that

between the distributions ofη. But the distributions ofηp are closer to each other when

compared to the case of instantaneousF0 andǫ. Note that this discrimination is sufficient

to model the difference in loudness between speech at different speaking rates. In the case

of speech at different loudness levels, there is significant increase in loudness, which is

captured by the loudness measure (η) [63]. On the other hand, in the case of speech at

different speaking rates, the change in the loudness measure is not significant. This small

change in loudness due to increased number of pitch periods per unit time is captured by

the proposed perceptual loudness measureηp.

−40 −20 0 20 400

µA−µ

d KL(A

−40 −20 0 20 400

µA−µ

d KL(A

−40 −20 0 20 40 0

µA−µ

d KL(A

−40 −20 0 20 400

µA−µ

d KL(A

Fig. 4.6: Variation of proposed measure (ηp) of perceptual loud-ness with speaking rate. (a) and (b) show the results of intra-classand inter-class comparisons, respectively, for fast vs normal case.(c) and (d) show the results of intra-class and inter-class compar-isons, respectively, for slow vs normal case.

The KL divergence measure is used to evaluate the variationsof ηp with speaking rate.

The evaluation procedure is similar to the one used in the cases of instantaneousF0, ǫ,

andη. Figs. 4.6(a) and (b) show the fast vs normal intra-class andinter-class comparison

points, respectively. Figs. 4.6(c) and (d) show the slow vs normal intra-class and inter-

class comparison points, respectively. The spread of the inter-class comparison points

(Figs. 4.6(b) and (d)) is greater than that of the intra-class comparison points (Figs. 4.6(a)

and (c)) in both the fast vs normal and the slow vs normal cases. The difference between

spreads of the inter-class and intra-class comparison points in the case ofηp is greater

than that in the case ofη (Fig. 4.5). Also the slow vs normal inter-class comparison points

(Fig. 4.6(d)) are spread both in the first and second quadrants. This implies that for some

speakers, normal speech sounds louder than slow speech, while for other speakers it is

the opposite. The fast vs normal inter-class comparison points (Fig. 4.5(b)) for the case

of η are spread across the first and second quadrants, contrary tothe perceptual loudness

scores. But in the case ofηp they are spread only in the first quadrant (Fig. 4.6(b)), which

implies that fast speech sounds louder than normal speech inmost cases, an observation

which correlates with perceptual loudness scores. Therefore the proposed measure (ηp)

correlates well with the perceptual loudness scores in boththe fast vs normal and the slow

vs normal cases, which implies thatηp is a better measure of perceptual loudness thanη,

for speech at different speaking rates.

cases in whichηp increases (or decreases) when speaking rate is decreased (or increased).

Similar to the cases of instantaneousF0 (shown in Sec. 3.3) and strength of excitation

at epoch (shown in Sec. 3.4), two CPDFs are computed from the difference between

mean values ofηp of a fast utterance and the corresponding normal utterance (denoted by

∆µFN), and mean values ofηp of a normal utterance and the corresponding slow utterance.

Fig. 4.7(a) and (b) shows the CPDFs constructed from the values of∆µFN and∆µNS. It can

be observed from Fig. 4.7(a) that CPDF is equal to 0.25 when∆µFN = 0. This implies that

for 25 % of the values∆µFN ≤ 0 and for 75 % of the values∆µFN > 0. This observation

shows that in 75 % of the casesηp increases when speaking rate is increased and in 25

% of the casesηp decreases when speaking rate is increased. Similarly, fromthe CPDF

constructed from the values of∆µNS (shown in Fig. 4.7(b)), it can be observed that CPDF

is equal to 0.5 when∆µNS = 0. This implies that form 50 % of the values∆µNS ≤ 0 and

for 50 % of the values∆µNS > 0. This observation shows that in 50 % of the casesηp

increases when speaking rate is decreased and in the other 50% of the casesǫ decreases

when speaking rate is decreased.

4.5 Summary

In this chapter, a study on the effect of speaking rate on the perception of loudness was

presented. A perceptual loudness test and analysis of the variations of an objective loud-

ness measure extracted from speech signals was performed. Observations of the percep-

tual loudness test results showed that for most speakers fast speech was perceived louder

than normal speech, whereas the loudness difference between normal speech and slow

−40 −20 0 20 40 0

∆µFN

−40 −20 0 20 40 0

∆µSN

Fig. 4.7: Cumulative probability density function of difference be-tween mean values of proposed loudness measure (ηp) of (a) fastutterances and normal utterances, and (b) slow utterances andnormal utterances.

speech is speaker specific. Analysis of variations of loudness measure (η) did not show

significant changes with speaking rate. A modified measure ofloudness (ηp) in the case

of speech at different speaking rates is proposed and its variations seems tocorrelate well

with subjective loudness variations.

Chapter 5

Incorporation of excitation source

variations in synthesis of speech at

different speaking rates

When humans modify speaking rate the obvious change occurs in the duration. There are

several approaches in literature for duration modificationof a given speech signal. Some

of these approaches use sinusoidal model, pitch synchronous overlap and add (PSOLA),

and phase vocoders to modify the duration [65, 66]. These methods modify the speech

signal directly to achieve the desired duration modification, which may produce some

spectral and phase distortions. Modification of the linear prediction (LP) residual to

achieve the desired duration modification was proposed in [67]. Modification of dura-

tion in the residual domain will reduce the spectral and phase distortions [67]. All the

above mentioned methods modify the duration of the speech signals uniformly. In gen-

eral, it is observed that all speech regions are not uniformly modified with changes in the

speaking rate. There are a few methods suggested in the literature to perform nonuniform

duration modification [66, 55, 54]. The assumption in most nonuniform duration mod-

ification methods is that compression and expansion do not occur during sounds which

are not voiced, and they occur during voicing due to changes in the speaking rates. To

perform nonuniform duration modification, voicing probability derived from sinusoidal

pitch estimate was used in [66], information from voicing onset time was used in [54].

It was shown in previous chapters (chapters 2, 3 and 4) and several studies in liter-

ature that, along with duration, various other acoustic features also vary when speaking

rate is changed [39]. These changes occur at subsegmental (less than pitch period), seg-

mental, and suprasegmental level. For producing natural sounding synthetic speech at

different speaking rates, not only the overall duration, but thesubsegmental, segmental,

and suprasegmental level variations also need to be captured and incorporated in synthe-

sis. This paper attempts to incorporate some of the variations at subsegmental level into

a nonuniform duration modification algorithm. The nonuniform duration modification al-

gorithm is based on the epoch-based uniform duration modification approach proposed in

In section 5.1, analysis of variations in duration of different sound units is presented. A

nonuniform duration modification method, which incorporates duration variations shown

in section 5.1 and variations of the excitation source features described in chapter 3, is

described in section 5.2. Evaluation of the proposed methodis given in section 5.3. Sec-

tion 5.4 summarizes the chapter.

5.1 Variation of durations of voiced, unvoiced and silence

Speech signal can be considered as the output of a linear system (vocal-tract) excited by

the vibrations of vocal folds. Based on the type of vibrationof the vocal folds, speech

signal can be divided into three regions, namely voiced, unvoiced and silence regions.

The vibration of vocal folds in voiced speech is periodic, where as the vibration in the

unvoiced regions is random and the excitation to the vocal-tract system in these regions

can be modelled as noise-like. Vibrations of the vocal foldsceases in the case of silence

regions. In this section, analysis of the variations of durations of these regions with speak-

ing rate is presented, to understand the effect of speaking rate variation on the gross level

characteristics of the excitation source.

5.1.1 Identification of voiced, unvoiced and silence regions

During production of voiced speech, vibration of the vocal folds is prominent with high

strength of excitation. In the absence of vocal fold vibration, the vocal-tract system can

be considered to be excited by random noise, as in the case of fricatives. The energy of

the random noise excitation is distributed both in time and frequency domains. While the

energy of an impulse is distributed uniformly in the frequency domain, and it is highly

concentrated in the time-domain. As a result, the zero frequency filtered signal exhibits

significantly lower amplitudes for random noise excitationcompared to the impulse-like

excitation. It was shown that the energy of the zero-frequency filtered signal described in

section 3.1 can be used to detect the regions of significant vocal folds vibration (voiced

regions). Fig. 5.1(b) shows the zero frequency filtered signal of a speech signal shown

in Fig. 5.1(a). The amplitude of the zero frequency filtered signal is significantly higher

in the voiced region compared to nonvoiced regions (for example, the region around 0.3

seconds in Fig. 5.1). The energy of the filtered signal (vz f r) over an interval of 10 ms is

used as the feature for discriminating between voiced and nonvoiced speech. The binary

voiced-nonvoiced signal is computed as

dvnv[n] =

1, if yz f r[n] > 0.5

0, otherwise,(5.1)

whereyz f r[n] = 1− e(−10×vz f r[n]) . Fig. 5.1 (d) shows the voiced-nonvoiced decision for

speech signal shown in Fig. 5.1 (a).

Nonvoiced regions consist of silence and unvoiced regions of speech. Unvoiced re-

gions typically have a higher spectral energy than silence regions at higher frequencies

(around 3000 Hz). This information is used to separate silence regions and unvoiced re-

gions in the nonvoiced regions. A resonator located at 3000 Hz and having a bandwidth

of 100 Hz is used to filter the speech signals. The output signal will have frequencies only

in the regions around 3000 Hz. The system function of a two pole resonator having center

0 0.1 0.2 0.3 0.4 0.5−1

0 0.1 0.2 0.3 0.4 0.50

Time (s)Fig. 5.1: Illustration of voiced-nonvoiced discrimination using zero fre-quency filtered signal. (a) Segment of a speech signal, (b) zero frequencyfiltered signal, (c) energy of the zero frequency filtered signal, and (d) binaryvoiced-nonvoiced signal.

frequency at 3000 Hz and bandwidth of 100 Hz is given by

H(z) =1

1− 2e−π

8 cos(5π4 )z−1 + e−

4 z−2. (5.2)

A high frequency signal (xuv[n]), is obtained by filtering the speech signal throughH(z).

The energy ofxuv[n] over a 10 ms window represented byvuv[n], is used to discriminate

unvoiced and silence regions. A binary unvoiced-silence decision is computed as

duv[n] =

1, if vuv[n] > vt

0, otherwise,(5.3)

wherevt is the threshold. The value of the threshold is the maximum ofaverage energies

of starting 200 ms and final 200 ms ofvuv[n]. It is based on the assumption that there is a

silence region at the beginning and ending of a sentence. Fig. 5.2(d) shows the unvoiced

decision plot obtained using eqn. (5.3). The remaining regions between start and end

0 0.1 0.2 0.3 0.4 0.5−1

0 0.1 0.2 0.3 0.4 0.50

Time (s)Fig. 5.2: Illustration of unvoiced-silence discrimination using output of theresonator H(z) (equation 5.3). (a) Segment of a speech signal, (b) output ofthe resonator H(z), (c) energy of the filtered signal, and (d) binary unvoiced-silence signal.

points of an utterances are chosen as silence regions. From Fig. 5.2 it is clear that the

proposed boundary identification method is fairly reliable, and these boundaries are used

to compute the durations of voiced, unvoiced and silence regions in a given sentence.

5.1.2 Duration variations

For each utterance in the database, the durations of voiced,unvoiced and silence regions

are obtained. Measures used to analyze the duration variations of the segments with

speaking rate are, Percentage deviation of duration of a segment and duration modification

factor of a segment. The termsegmentrefers to any one of{voiced, unvoiced, silence}

regions. Percentage deviation of duration of a segment is determined by

D =y− x

x× 100

Table 5.1: Percentage deviation in the durations of speech segments for different speakingrates

Speech Normal to fast Normal to slowSegment µ σ µ σ

voiced -22.71 8.99 37.67 20.88unvoiced -26.90 25.68 51.64 50.24silence -16.49 10.75 40.79 40.66

where x and y are durations of reference and test segments. Segments under normal

speaking rate are chosen as reference segments, where as segments under fast or slow

speaking rates are chosen as test segments.

For each of the voiced, unvoiced and silence regions the percentage deviation of du-

ration is computed when speaking rate is changed from normalto fast, or normal to slow.

The details of percentage deviation of duration are given inTable 5.1. The numbers shown

in Table 5.1 are mean and standard deviation of the percentage deviation of duration of

segments computed using the 250 utterances (25 speaker× 10 utterances). Negative sign

of mean indicates decrease in duration and positive sign indicates increase in duration.

Table 5.1 shows that the percentage deviation of duration ofunvoiced regions is greater

than that of voiced regions in both normal to fast and normal to slow cases. Unvoiced

regions consist of unvoiced consonants (fricatives and unvoiced stops), whereas voiced

regions consist of vowel and voiced consonants. Previous studies have reported that du-

rations of vowels and pauses changed significantly when the speaking rate is changed,

and durations of consonants change very little [54]. This study shows that not only the

durations of voiced regions, but the durations of unvoiced regions also undergo significant

changes when the speaking rate is changed. Also, the percentage change of duration is

less in normal to fast change compared to normal to slow change, which correlate with

previous studies [54]. Variance (σ) values shown in Table 5.1 are large and comparable

with mean (µ) values. In order to capture the variation in durations of these segments,

relation between duration modification factor (α) of these units andα of an utterance is

analyzed.

The duration modification factor of a unit is defined asα = t f

ti, wheret f andti are the

initial and final durations, when the speaking rate is changed from normal to non-normal

0 0.5 10

1 2 3 41

(f)(e)(d)

(a) (b) (c)

Fig. 5.3: Scatter plots of duration modification factors of a region vs thecorresponding utterance. (a), (b) and (c) show the scatter plots of αv vs αS,αu vs αS, and αp vs αS, respectively in normal to fast conversion. Similarly,(d), (e) and (f) show the scatter plots in normal to slow conversion. In eachplot, straight line that fits the cluster is shown by a solid line.

(fast or slow) speaking rate. The duration modification factors of voiced, unvoiced, silence

and whole sentence are represented asαv, αu, αp andαS, respectively. Ifαr represents

any one of voiced, unvoiced and silence regions, it is analyzed with reference to theαS to

identify the relationship betweenα of a region and the utterance. Fig. 5.3 shows the scatter

plots ofαr vsαS in both normal to fast and normal to slow cases. In order to identify the

relationship betweenα of a region and the utterance, a polynomial curve fitting algorithm

was used to find a straight line that best fits the clusters shown in Fig. 5.3. The slopes of

the lines that fit the cluster and the root mean square error (RMSE) of the fits are given in

Table 5.2

Table 5.2 shows that the error of the fit is highest for silenceregions and lowest for

voiced regions in both normal to fast and normal to slow cases. This implies that vari-

ability in the way in which voiced duration is modified, is less than that of unvoiced and

silence regions. Also, error of the fit is more for normal to slow case than for normal to

Table 5.2: Slopes (m) of the straight lines that fit the clusters shown in Fig. 5.3, and root meansquared error (RMSE) of the fits to illustrate the variation in duration modification of differentunits.

Speech Normal to fast Normal to slowSegment m RMSE m RMSEvoiced 0.68 0.057 0.44 0.145

unvoiced 0.69 0.148 0.63 0.4902silence 1.434 0.175 2.39 0.6482

fast case. This implies that variability in duration modification of regions during normal

to fast conversion is less than for normal to slow conversion. If the slope of the line that

fits the cluster is approximately equal to 1, thenαr is equal toαS. Likewise, if the slope

of the line is less than 1, thenαr is less thanαS, and vice-versa. Observing the slopes of

lines that fit the clusters, the slopes of lines corresponding to voiced and unvoiced regions

are less than 1 and that of silence region are more than 1. Thisimplies that in both normal

to fast and normal to slow conversions, the duration modification factor of voiced and un-

voiced regions is less than the duration modification factorof the utterance, and duration

modification factor of silence regions is higher than than duration modification factor of

the utterance. This analysis showed that humans rely more onvariations of silence re-

gions, than on variations of voiced and unvoiced regions during the production of speech

at different speaking rates.

5.2 Synthesis of speech at different speaking rate

It was shown in section 5.1.2 that nonuniform duration modification occurs when humans

produced speech at different speaking rates by analyzing the variation in the durations

of voiced, unvoiced and silence regions. When the task is to modify the duration of

an utterance by a required factor, prior information about duration modification factors

of voiced, unvoiced and silence regions is required. The information about the factor

with which the duration of a segment has to be modified, when the duration modification

factor of an utterance is available, is obtained from the lines that fit the clusters shown in

Fig. 5.3. When the line is in the form ofy = mx+ c, as ‘m’ and ‘c’ are available, abscissa

denotes the duration modification factor of an utterance andordinate denotes the duration

modification factor of the unit. A nonuniform duration modification method which uses

the information of duration modification factors of a segment when duration modification

of an utterance is available, is proposed in this section.

The nonuniform duration modification method is based on epoch-based duration mod-

ification method proposed in [67]. There are four main steps involved in the epoch-based

time scale modification method (uniform duration modification) [67]:

1. Deriving the instants of significant excitation (epochs)from the LP residual signal.

2. Deriving a modified (new) epoch sequence according to the desired duration modi-

fication factor.

3. Deriving a modified LP residual signal from the modified epoch sequence.

4. Synthesizing speech using the modified LP residual and theLPCs.

It involves deriving a new excitation (LP residual) signal by incorporating the desired

modification in the duration of the utterance. This is done byfirst creating a new sequence

of epochs from the original sequence of epochs. Each epoch isassociated with the time,

pitch period, linear prediction (LP) residual and linear prediction coefficients (LPCs). The

new epoch sequence consists of either insertion of new epochs for time scale expansion,

or deletion of some epochs for time scale compression. The residual is accessed from the

original epochs, and it is modified according to the new epochsequence. To increase the

duration, some portions of the residual are replicated at specific locations. Similarly for

reducing the duration, some portions of the residual are omitted at specific locations.

The proposed nonuniform duration modification method is as follows: Linear predic-

tion residual of the speech signal is computed by performinga 12th order linear prediction

(LP) analysis using a 20 ms frame size and 5 ms frame shift. Identification of the bound-

aries of voiced, unvoiced and silence regions is performed using the method described

in Sec. 5.1.1. Modification of the duration of the LP residualis performed with dura-

tion modification factor equal to voiced duration modification factor using epoch-based

method [67]. This step takes care of the duration modification of voiced regions. Resid-

ual corresponding to silence and unvoiced regions are resampled to match the required

duration, and silence and unvoiced regions in scaled residual signal are replaced by the

resampled residual signal. This step takes care of the required duration modification of

silence and unvoiced region. The filter coefficients (LPCs) are updated depending on the

length of the modified LP residual. Speech for the desired duration modification can be

synthesized by exciting the all-pole filter using the modified LP residual.

It was shown in chapters 3 and 4 that when speaking rate is changed, along with mod-

ification of duration, the features related excitation source also changes. The variations of

the instantaneousF0 are incorporated in the proposed nonuniform duration modification

method using the epoch-based pitch modification algorithm,proposed in [67].

5.3 Evaluation of synthetic speech at different speaking

The performance of the proposed method for synthesis of speech at different speaking

rates is compared with epoch-based duration modification method using perceptual eval-

uation. Perceptual evaluation was carried out by conducting subjective tests with 10 re-

search scholars in the age group of 21-35 years. Two sentences were chosen to perform the

test. Speech signals were derived for the duration modification factors from 0.5 to 1.5 in

the steps of 0.2. For each modification factor, three types ofspeech signals were derived.

They are: speech signals using (i) uniform duration modification (U), (ii) nonuniform du-

ration modification (NU), and (iii) nonuniform duration modification with instantaneous

F0 variations incorporated (NU + F0). In theNU + F0 method, the instantaneousF0 is

modified by constant factor, 1.2 and 0.8 in for the cases of normal to fast and normal to

slow conversions, respectively. The tests were conducted in the laboratory environment

by playing the speech signals through headphones. Two typesof tests: (i) mean opinion

score (MOS) test in which the listener evaluates on a score from 1 (worst) to 5 (best)

based on the quality and perceptual distortion, and (ii) AB ranking test where the listener

has to choose an utterance from the presented utterances.

The mean opinion scores (MOS) for duration modification factor is shown in Ta-

Table 5.3: Mean opinion scores and AB ranking test (in %).

PPPPPPPPP

methodα

0.5 0.7 0.9 1.1 1.3 1.5 AB-test

U 2.0 3.0 3.0 3.0 3.0 3.0 12.5

NU 2.6 3.3 3.6 3.0 3.3 3.3 50.0

NU + F0 2.6 3.0 4.0 3.3 3.0 3.0 37.5

ble 5.3. It can be observed from perceptual evaluation that speech synthesized fromNU

method has higher scores thanU for almost all the duration modification factors (expect

for α=1.1). Similarly,NU + F0 has higher scores thanU for some values ofα and equal

for the rest. AB ranking test (shown in last column of Tab. 5.3), a significant number of

listeners have chosen speech signals synthesized usingNU andNU+F0 methods (50 %+

37.5 %). It was shown in Sec. 3.3 that instantaneousF0 varies significantly with speaking

rate, but the difference between scores ofNU andNU + F0 is not as expected. The qual-

ity of NU + F0 can be improved by modifying the instantaneousF0 speaker-specifically.

Also, a study of the temporal variations of the instantaneous F0 and incorporating these

variations in synthesis of speaking rate may improve the quality.

5.4 Conclusions

In this chapter, the effect of speaking rate on durations of voiced, unvoiced and silence

regions was studied. The study showed that variability is less in voiced duration modifica-

tion than unvoiced and silence duration modification. Also voiced duration modification

factor is less than unvoiced and silence segments. A nonuniform duration modification

method is proposed which is based in the epoch-based uniformduration modification

method, which uses the needed information about the duration modification factors of

voiced, unvoiced and silence regions from scatter plots ofαr vs αS. The variations of

instantaneousF0 have been incorporated in the proposed nonuniform durationmodifica-

tion method. Perceptual evaluation results showed that speech signals synthesized from

the proposed methods have higher scores the epoch-based duration modification method.

The quality of synthesis of speech at different speaking rates can be improved further by

incorporating variations in the perceived loudness and temporal variations of the instan-

taneousF0.

Chapter 6

Summary and Conclusions

6.1 Summary of the work

Speaking rate and its effects have been studied in the literature using features of articu-

latory movements, and the features at segmental and suprasegmental level derived from

the acoustic speech signal. This paper presents an analysisof the effect of speaking rate

and its change, on the features of source of excitation of thevocal tract system. Three

features related to source of excitation, namely, (a) instantaneous fundamental frequency,

(b) strength of excitation, and (c) a measure of perceived loudness are derived. Of these,

the instantaneousF0 and the strength of excitation at epoch are derived from the zero-

frequency filtered signal. The zero-frequency filtered signal carries information about the

sequence of impulse-like excitations involved in the production of voiced speech. The

featureη is derived from the Hilbert envelope of the linear prediction residual of speech

signal, where the residual is an estimate of the source of excitation.

The variations of instantaneousF0 and ǫ, with speaking rate showed a systematic

trend. Speech at high speaking rates (fast speech) generally has higher instantaneousF0

and lowerǫ than the speech spoken at low speaking rates (slow speech). Most of the

speakers increased their instantaneousF0 when speaking fast, but when the speaking rate

is decreased, the instantaneousF0 increased for some speakers and decreased for some

others. For most of the speakers,ǫ was observed to decrease when speaking fast, but when

speaking rate is decreased,ǫ decreased for some speakers and increased for some others.

In order to generalize the observations, the distribution of each feature for a given speaker

and a given speaking rate was approximated by a univariate Gaussian probability density

function. Kullback-Leibler divergence was then employed to estimate the discrimination

between the distributions of a feature for two different speaking rates.

The effect of speaking rate on the perception of loudness was also investigated. Sub-

jective listening tests were conducted to determine whether fast or slow speech sounded

louder compared to normal speech. From the subjective teststhe following conclusions

are drawn: (i) The subjects perceived a difference in loudness between speech at nonnor-

mal (fast or slow) and normal speaking rates, for a significant number (about 80 %) of

the utterances used for listening. (ii) Fast speech sounds louder than normal speech in

a majority of utterances (73 %). (iii) Slow speech sounds louder than normal for some

utterances (33 %), while the reverse is true for some other utterances (43 %). An acous-

tic featureη which measures perceived loudness was employed to measure the loudness

differences between speech at different speaking rates. The featureη did not show signif-

icant difference between utterances of different speaking rates, unlike the scores obtained

from perceptual studies. A modified measure of perceptual loudness, denoted byηp, is

defined on the basis of the number of impulse-like excitations produced per unit time. The

variation ofηp with speaking rate was analyzed. Variations ofηp correlated well with the

results of subjective studies. Thereforeηp is a better measure of perceptual loudness than

η for the case of speech at different speaking rates.

In order to understand the significance of excitation sourcevariations to the perception

of speaking rate, the observed excitation source variations were incorporated in uniform

duration modification method. Perceptual evaluation results showed that speech signals

synthesized from the proposed methods have higher scores the epoch-based duration mod-

ification method.

6.2 Major contributions of the work

The important contribution of the research work reported inthis thesis is the ‘Analysis

of speech at different speaking rates using excitation source features’. The excitation

source used are instantaneous fundamental frequency, strength of excitation at epoch and

perceived loudness. The major contributions of this thesisare:

• Studied the significance of speaking rate in human speech communication.

• Analyzed the variations of excitation source features withspeaking rate.

• Analyzed the effect of speaking rate on perceived loudness and proposed a loudness

measure which can capture the loudness variations in speechat different speaking

rates.

• Analyzed the duration variations of speech segments with respect to duration vari-

ations of the utterance.

• Proposed a nonuniform duration modification method which incorporates the exci-

tation source variations to synthesize speech at different speaking rates.

6.3 Directions for future work

• The research work in this thesis studied the variation of three excitation source fea-

tures with speaking rate. Similar studies can be conducted on other features related

excitation source such as: normalized error, open quotientratio, closed quotient

ratio, voice quality related features etc.

• When humans modify speaking rate, the changes in the features occur in a nonuni-

form fashion. Identification of regions in which significantchanges in the excitation

source occur is an interesting problem.

• Variations of other excitation source features can be incorporated in the synthesis

of speech at different speaking rates.

• Combining the variations in features related to vocal-tract system and excitation

source to synthesize speech at different speaking rates.

List of Publications

Journals

1. Sri Harish Reddy M, Guruprasad S, and B. Yegnanarayana, “Analysis of speech at

different speaking rates”, to be submitted toComputer Speech& Language.

Conferences

1. Sri Harish Reddy M, B. Yegnanarayana, “Incorporation of Excitation Source and

Duration Variations in Speech Synthesized at Different Speaking Rates”, accepted

in Proc. Speech Prosody 2010, Illinois, USA.

2. Sri Harish Reddy M, Sudheer Kumar K, Guruprasad S and B. Yegnanarayana,

“Subsegmental Features for Analysis of Speech at Different Speaking Rates” in

Proc. ICON 2009, Hyderabad, India.

References

[1] J. Laver,Principles of phonetics. Cambridge: Cambridge University Press, 1994.

[2] I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech:a review of the literature on human vocal emotion,”J. Acoust. Soc. Am., vol. 93,no. 2, pp. 1097–1108, Feb. 1993.

[3] R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,”J. Pers.Soc. Psychol., vol. 70, no. 3, pp. 614–636, Mar. 1996.

[4] K. R. Scherer, “Acoustic concomitants of emotional dimensions: Judging affect fromsynthesized tone sequences,”Nonverbal Communication, p. 249253, 1974.

[5] W. Apple, L. Streeter, and R. Krauss, “Effects of pitch and speech rate on personalattributions,”J. Pers. Soc. Psychol., vol. 37, no. 5, pp. 715–727, May 1979.

[6] S. Lively, D. Pisoni, W. van Summers, and R. Bernacki, “Effects of cognitive work-load on speech production: acoustic analyses and perceptual consequences,”J.Acoust. Soc. Am., vol. 93, no. 5, pp. 2962–2973, May 1993.

[7] W. Levelt,Speaking: From Intention to Articulation. Cambridge, MA: MIT Press,1989.

[8] I. Fnagy and K. Magdics, “Speed of utterance in phrases ofdifferent lengths,”Lan-guage and Speech, vol. 3, no. 4, pp. 179–192, 1960.

[9] B. Smith, B. Brown, W. Strong, and A. Rencher, “Effects of speech rate on person-ality perception,”Language and Speech, vol. 18, pp. 145–152, 1975.

[10] J. v. d. Weijer, “Language input to a prelingual infant,” in Proc. GALA ’97 Conf. onLanguage Acquisition, Edinburgh, Scotland, 1997, pp. 290–293.

[11] R. M. Uchanski, S. S. Choi, L. Braida, R. C. M., and N. Durlach, “Speaking clearlyfor the hard of hearing iv: further studies of the role of speaking rate,” J. Speech.Hear. Res., vol. 39, pp. 494–509, 1996.

[12] D. Abercrombie,Elements of General Phonetics. Edinburgh: Edinburgh UniversityPress, 1967.

[13] J. Trouvain, “Tempo varation in speech prpduction: Implication for speech synthe-sis,” PhD Thesis, Saarland University, Saarbrcken, Germany, 2004.

[14] H. Nanjo and T. Kawahara, “Language model and speaking rate adaptation for spon-taneous presentation speech recognition,” vol. 12, no. 4, pp. 391–400, Jul. 2004.

[15] B. Zellner, “Fast and slow speech rate: a characterisation for French,” inProceedingsof the International Conference on Spoken Language Processing, vol. 7, Sydney,Australia, Dec. 1998, pp. 3159–3163.

[16] N. Morgan and E. Fosler-Lussier, “Combining multiple estimators of speaking rate,”in Proc. Int. Conf. Acoust., Speech, Signal Process., vol. 2, Seattle, Washington,USA, May 1998, pp. 729–732.

[17] J. Zheng, H. Franco, and A. Stolcke, “Rate-dependent acoustic modeling for largevocabulary conversational speech recognition,” inProc. ISCA Tutorial and ResearchWorkshop on Automatic Speech Recognition: Challenges for the New Millennium,Paris, France, 2000, pp. 145–149.

[18] T. Pfau, R. Faltlhauser, and G. Ruske, “A combination ofspeaker normalization andspeech rate normalization for automatic speech recognition,” in Proceedings of theInternational Conference on Spoken Language Processing, vol. 4, Beijing, China,Oct. 2000, pp. 362–365.

[19] H. Nanjo and T. Kawahara, “Speaking-rate dependent decoding and adaptation forspontaneous lecture speech recognition,” inProc. Int. Conf. Acoust., Speech, SignalProcess., Orlando, Florida, USA, May 2002, pp. 725–728.

[20] E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur, “Prosody-based automaticsegmentation of speech into sentences and topics,”Speech Commun., vol. 32, no.1-2, pp. 127–154, 2000.

[21] M. Siegler, “Measuring and compensating for the effects of speech rate in large vo-cabulary continuous speech recognition,” MS Thesis, Carnegie Melon Univ., Pitts-burgh, PA, USA, 1995.

[22] P. Mermelstein, “Automatic segmentation of speech into syllabic units,”J. Acoust.Soc. Am., vol. 58, no. 4, pp. 880–883, 1975.

[23] J. S. Garofalo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D.S. Pallett, andN. L. Dahlgren, “The DARPA TIMIT acoustic-phonetic continuous speech corpuscdrom,” Linguistic Data Consortium, Philadelphia, PA, USA, 1993.

[24] CMU-ARCTIC speech synthesis databases. [Online]. Available: http://festvox.org/cmu arctic/index.html

[25] D. B. Paul and J. M. Baker, “The design for the wall streetjournal-based csr corpus,”in HLT ’91: Proceedings of the workshop on Speech and Natural Language. Har-riman, New York: Association for Computational Linguistics, 1992, pp. 357–362.

[26] N. Mirghafori, E. Fosler, and N. Morgan, “Fast speakersin large vocabulary contin-uous speech recognition: analysis and antidotes,” inProc. European Conf. SpeechProces. and Techn., Madrid, Spain, 1995, pp. 491–494.

[27] C. Fougeron and S.-A. Jun, “Rate effects on French intonation: Prosodic organiza-tion and phonetic realization,”J. Phonetics, vol. 26, no. 1, pp. 45–69, Jan. 1998.

[28] B. Lindbloom, “Spectrographic study of vowel reduction,” J. Acoust. Soc. Am.,vol. 35, no. 11, pp. 1773–1781, Nov. 1963.

[29] R. D. Kent and K. L. Moll, “Cinefluorographic analyses ofselected lingual conso-nants,”J. Speech. Hear. Res., vol. 15, pp. 453–473, 1972.

[30] T. Gay, “Mechanisms in the control of speech rate,”Phonetica, vol. 38, pp. 148–158,1981.

[31] B. Lindbloom, “Articulatory activity in vowels,”Speech Transmission Laboratory:Quarterly Progress and Status Report (KTH, Stockholm), vol. 2, pp. 1–5, 1964.

[32] J. H. Abbs, “The influence of the gamma motor system on jawmovements duringspeech: a theoretical framework and some preliminary observations,” J. Speech.Hear. Res., vol. 16, pp. 175–200, 1973.

[33] D. P. Kuehn and K. Moll, “A cinefluographic investigation of CV and VC articula-tory velocities,”J. Phonetics, vol. 3, pp. 303–320, 1976.

[34] O. Engstrand, “Articulatory correlates of stress and speaking rate in swedish VCVutterances,”J. Acoust. Soc. Am., vol. 83, no. 5, pp. 1863–1875, May 1988.

[35] K. Munhall and A. Lofqvist, “Gestural aggregation in speech: Laryngeal gestures,”J. Phonetics, vol. 20, pp. 111–126, 1992.

[36] R. Krakow, “Nonsegmental influences on velum movement patterns: syllables, sen-tences, stress, and speaking rate,”Phonetics and phonology: Nasals, nasalization,and the velum, vol. 5, pp. 87–116, 1993.

[37] D. P. Kuehn, “A cineradiographic investigation of velar movement variables in twonormals,”Cleft Palate J., vol. 13, pp. 88–303, 1976.

[38] D. Ostry and K. G. Munhall, “Control of rate and durationof speech movements,”J. Acoust. Soc. Am., vol. 77, no. 2, pp. 640–648, 1985.

[39] T. Gay, “Effects of speaking rate on vowel formant movements,”J. Acoust. Soc. Am.,vol. 63, no. 1, pp. 223–230, Jan. 1978.

[40] T. Gay, T. Ushijima, H. Hirose, and F. S. Cooper, “Effect of speak-ing rate on labial consonant-vowel articulation,”J. Phonetics, vol. 2, pp. 47–63,1974.

[41] H. G. Tillmann and H. R. Pfitzinger, “Local speech rate: Relationships betweenarticulation and speech acoustics,” inProc. 15th Int. Conf. Phonetic Sciences,Barcelona, Spain, 2003, pp. 3177–3180.

[42] J. E. Flege, “Effects of speaking rate on tongue position and velocity of movementin vowel production,”J. Acoust. Soc. Am., vol. 84, no. 3, pp. 901–916, 1988.

[43] T. Gay and H. Hirose, “Effect of speaking rate on labial consonant produc-tion: a combined electromyographic high-speed motion picture study,”Phonetica,vol. 27, pp. 203–213, 1973.

[44] T. Gay and T. Ushijima, “Effect of speaking rate on stop consonant vowel articula-tion,” in Proc. of Speech Commun. Seminar, Stockholm, Sweden, 1975, pp. 205–209.

[45] L. Aijun and Z. Yiqing, “Speaking rate effects on discourse prosody in standardchinese,” inProc. of Speech Prosody, Campinas, Brazil, May 6-9 2008, pp. 499–452.

[46] J. Trouvain and M. Grice, “The effect of tempo on prosodic structure,” inProc.14th

Int. Conf. Phonetic Sciences, San Francisco, USA, 1999, pp. 1067–1070.

[47] M. Demol, W. Verhelst, and P. Verhoeve, “The duration ofspeech pauses in a mul-tilingual environment,” inProc. of INTERSPEECH, Antwrep, Belgium, Aug. 2007,pp. 990–993.

[48] A. Q. Summerfield, “Articulatory rate and perceptual constancy in phonetic percep-tion,” J. Experimental Psychology: Human Perception and Performance, vol. 7, pp.1074–1095, 1981.

[49] J. L. Miller and T. Baer, “Some effects of speaking rate on the production of/b/ and/w/,” J. Acoust. Soc. Am., vol. 73, no. 5, pp. 1751–1755, May 1983.

[50] J. L. Miller, K. P. Green, and A. Reeves, “Speaking rate and segments: a look at therelation between speech production and speech perception for the voicing contrast,”Phonetica, vol. 43, pp. 106–115, 1986.

[51] J. L. Miller and L. E. Volaitis, “Effects of speaking rate on the perceptual structureof a phonetic category,”Perception and Psychophysics, vol. 46, pp. 505–512, 1989.

[52] R. M. Theodore, J. L. Miller, and D. DeSteno, “The effects of speaking rate onvoice-onset-time is talker specific,” inProc.16th Int. Conf. Phonetic Sciences, Saar-brucken, Germany, 2007, pp. 473–476.

[53] R. H. Kessinger and S. E. Blumstein, “Effects of speaking rate on voice-onset time.and vowel production: Some implications for perception studies,” J. Phonetics,vol. 26, no. 2, pp. 117–128, Apr. 1998.

[54] K. S. Rao and B. Yegnanarayana, “Duration modification using glottal closure in-stants and vowel onset points,”Speech Commun., vol. 51, pp. 1263–1269, 2009.

[55] O. Donnellan, E. Jung, and E. Coyle, “Speech-adaptive time-scale modification forcomputer assisted language-learning,” inIEEE Int. Conf. on Advanced LearningTechnologies, Athens, Greece, Jul. 2003, pp. 165–169.

[56] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,”IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 8, pp. 1602–1613, Nov.2008.

[57] C. d’Alessandro and K. R. Scherer, “Voice quality: Functions, analysis and synthe-sis (VOQUAL’03),” ISCA Tutorial and Research Workshop, Geneva, Switzerland,Aug. 2003, http://archives.limsi.fr/VOQUAL/voicematerial.html (date last viewed04/08/2009).

[58] B. Yegnanarayana, K. S. R. Murty, and S. Rajendran, “Analysis of stop consonantsin indian languages using excitation source information inspeech signal,” inProc.Workshop Speech Anal. Process. Knowledge Discovery, Aalborg, Denmark, June4-6 2008.

[59] B. Yegnanarayana and K. S. R. Murty, “Event-based instantaneous fundamental fre-quency estimation from speech signals,”IEEE Trans. Audio, Speech, Lang. Process.,vol. 17, no. 4, pp. 614–624, May 2009.

[60] K. S. R. Murty, B. Yegnanarayana, and M. A. Joseph, “Characterization of glottalactivity from speech signals,”IEEE Signal Process. Letters, vol. 16, no. 6, pp. 469–472, June 2009.

[61] S. Kullback,Information Theory and Statistics. Mineola, New York: Dover Publi-cations Inc., 1968.

[62] T. M. Cover and J. A. Thomas,Elements of Information Theory. New York: Wiley,1991.

[63] G. Seshadri and B. Yegnanarayana, “Perceived loudnessof speech based on thecharacteristics of excitation source,”J. Acoust. Soc. Am., vol. 126, no. 4, pp. 2061–2071, Oct. 2009.

[64] A. V. Oppenheim and R. W. Schafer,Digital Signal Processing. Englewood Cliffs,New Jersey: Prentice Hall, 1975.

[65] E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech,”Speech Communication, vol. 16, pp. 175–206, 1995.

[66] T. Quatieri and R. McAulay, “Shape invariant time-scale and pitch modification ofspeech,”IEEE Trans. on Signal Process., vol. 40, no. 3, pp. 497–510, Mar 1992.

[67] K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of significantexcitation,”IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 972–980,May 2006.

ANALYSIS OF SPEECH AT DIFFERENT SPEAKING ...web2py.iiit.ac.in/publications/default/download/masters...ANALYSIS OF SPEECH AT DIFFERENT SPEAKING RATES USING EXCITATION SOURCE INFORMATION

Documents

Speech Intelligibility of Malayalam Speaking Cochlear...

SPEAKING IN PUBLIC: SPEECH DELIVERY -...

BLM3205 (PUBLIC SPEAKING)“DEMONSTRATIVE SPEECH”

How to Deliver a Speech Speaking Characteristics.

SPC 1017 SPEECH COMM. - Herb · SPC 1017 SPEECH COMM....

BLM3205 (PUBLIC SPEAKING)“PERSUASIVE SPEECH”

English Public Speaking Chapter 11 Delivering Your Speech

Public speaking and speech writing

Angga Speech (Public Speaking Drugs)

Language Production: Speaking, Writing, and Bilingualism ...

SPEAKING STYLES IN SPEECH RESEARCH 1. Introduction

Speech 1315.88 Online Public Speaking

SPEAKING STYLES IN SPEECH RESEARCH 1....

Speaking in Public Speech Delivery

Public Speaking Basics Preparing Your First Speech.

Speaking 4 Presentation Bibit Yohana SPEECH