VOICE CONVERSION USING ARTICULATORY FEATURES A THESIS submitted by BAJIBABU BOLLEPALLI 200731002 Master of Science (by Research) in Electronics and Communication Engineering Language Technologies Research Centre International Institute of Information Technology Hyderabad- 500 032, India JUNE 2012
83
Embed
VOICE CONVERSION USING ARTICULATORY …web2py.iiit.ac.in/research_centres/publications/download/masters...VOICE CONVERSION USING ARTICULATORY FEATURES ... Sivanand, Ravi shankar Prasad,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
VOICE CONVERSION USING ARTICULATORYFEATURES
A THESIS
submitted by
BAJIBABU BOLLEPALLI
200731002
Master of Science (by Research)
in
Electronics and Communication Engineering
Language Technologies Research Centre
International Institute of Information Technology
Hyderabad- 500 032, India
JUNE 2012
To my parents, friends and guide
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Voice conversion using artic-
ulatory features” by Bajibabu Bollepalli (200731002), has been carried out under my
supervision and is not submitted elsewhere for a degree.
Date Adviser: Dr. Kishore Prahallad
Acknowledgements
I would like to express my deepest respect and most sincere gratitude to Dr. Kishore Pra-
hallad, for his constant guidance, encouragement at all stages of my work. I am fortunate
to have numerous technical discussions with him from which I have benefited enormously.
I thank him for allowing me to explore the challenging world of speech technology and
for always finding the time to discuss the difficulties along the way.
I thank my thesis committee members, Dr. Garimella Rama Murthy and Dr. Anil Ku-
mar Vuppala for sparing their valuable time to evaluate the progress of my research work.
I am thankful to Prof. B. Yegnanarayana, Dr. Suryakanth, and Dr. Rajendran for their
immense support and help through my research work. I am thankful to them for all the
invaluable advise on both technical and nontechnical matters. I thank my senior laboratory
members for all the cooperation, understanding and help I received from them.
I am very grateful for having had the opportunity to study among my colleagues:
Ronanki srikanth, Sathya adithya thati, Ch. Nivedita, E. Naresh kumar, P. Gangamohan,
mel-frequency cepstral coefficients (MFCC) [29], etc.). Features such as pitch period,
residual, glottal closure instants, etc., are derived from the excitation signal.
1.3.2 Alignment of parallel data
Voice conversion systems are capable of learning transformation functions from the train-
ing data of the source and target speakers. In order to map the source speakers acoustic
space to the target speakers acoustic space, it is necessary to know about the source-target
correspondence between different training units. The process in which this correspon-
dence is established is called alignment. In this case, the most preferred frame-alignment
technique is dynamic time-warping (DTW), almost a standard in voice conversion sys-
tems [12] [18] [30].
As the durations of the parallel utterances typically differ (as shown in Fig. 1.1), dy-
namic time warping is used to align the vectors of the source and target speakers. Fig. 1.1
3
is a plot of an utterance recorded by two speakers. The utterance consists of 18 phones,
the boundaries of which are indicated by the vertical lines. It is very clear from this figure
that the durations of the phones in both the recorded utterances are different even though
the spoken sentence is the same. Fig. 1.2 shows that the durations of the two utterances
match after applying DTW.
Fig. 1.1: Plot of an utterance recorded by two speakers showing that their durations differeven if the spoken sentence is the same. The spoken sentence is “Will we ever forget it”which has 18 phones “pau w ih l w iy eh v er f er g eh t ih t pau pau” according to the USEnglish phoneset. Adopted from [1].
Fig. 1.2: Plot of an utterance recorded by two speakers showing that their durations matchafter applying DTW. The spoken sentence is “Will we ever forget it” which has 18 phones“pau w ih l w iy eh v er f er g eh t ih t pau pau” according to the US English phoneset. Adoptedfrom [1].
1.3.3 Training/testing in voice conversion
The schematic diagrams of training and testing modules in parallel voice conversion are
shown in Fig. 1.3.(a) and Fig. 1.3.(b), respectively. The training module of a voice conver-
sion system to transform both, the excitation and the spectral features (filter parameters)
from a source speakers acoustic space to a target speakers acoustic space, is shown in
4
Fig. 1.3.(a). Fig. 1.3.(b) shows the block diagram of various modules involved in a voice
conversion testing process. In testing or conversion, the transformed spectral features,
along with excitation features, can be used as input to a speech production model (source-
filter) to synthesize the transformed utterance.
Source speaker utterances
Target speaker utterances
Feature extraction
Excitation features
Spectral features alignment (DTW)
Calculation of statistics for linear
transformation
Mapping functions
Source Speech Feature extraction
Mapping function from
training
Linear transformation using statistics from training
Speech model
Spectral features
Excitation features Transformed speech
(a) TRANINING
(b) TESTING
Fig. 1.3: Block diagram of training and testing modules in the voice conversion framework.
1.3.4 Mapping function
Mapping of spectral features
After the alignment is done, to obtain a transformation function between the spectral
features of the source speaker’s acoustic space and the target speaker’s acoustic space,
machine learning techniques such as vector quantization (VQ) [12], hidden markov mod-
neural networks (ANN) [20] [21] [36], dynamic frequency warping (DFW) [13] and unit
selection [37] are applied.
5
Mapping of excitation features
Though the residual signal is impulse-like for voiced frames and noise-like for unvoiced
frames, it contains the glottal characteristics of speech that are not modeled by spectral
features. The excitation signal also contains information that could help to achieve the
required conversion performance and quality.
A logarithmic Gaussian normalized transformation [38] is used to transform the fun-
damental frequency F0 of a source speaker to the F0 of a target speaker as indicated in the
equation 1.1 below. The assumption in this case is that the major cues of speaker identity
lie in the spectral features and hence just a linear transformation is sufficient to transform
the excitation characteristics.
log(F0c) = µt +σt
σs(log(F0s) − µs) (1.1)
where µs and σs are the mean and variance of the fundamental frequencies in loga-
rithm for the source speaker, F0s is the pitch of source speaker and F0c is the converted
pitch frequency.
1.3.5 Evaluation metrics for voice conversion
A successful voice conversion system must be good in terms of naturalness, intelligibil-
ity, and identity qualities. Naturalness is how human-like the produced speech sounds.
Intelligibility is how much it is possible to correctly understand the words that were said,
and identity is the recognizability of the individuality of the speech. Different methods
have been proposed to measure these qualities. Some are objective measures, which can
automatically be computed from the audio data. They are typically faster and cheaper
to compute as they do not involve human experiments. Others are subjective measures,
which are based on the opinions expressed by humans in listening evaluations, or on other
human behaviour.
6
Objective measures
Distance measures are used most commonly for providing objective scores. One among
them is spectral distortion (SD) which has been widely used to quantify spectral envelope
conversions. For example, Abe et.al, 1988 [12] measured the ratio of spectral distortion
between the transformed and target speech and the source and target speech as follows:
R =S D(trans, tgt)S D(src, tgt)
(1.2)
where R is the normalized distance, S D(trans, tgt) is the spectral distortion between
the transformed and the target speaker utterances and S D(src, tgt) is the spectral distortion
between the source and the target speaker utterances.
A comparison of the performance of different types of conversion functions using a
warped root mean square (RMS) log-spectral distortion measure was reported in [16].
Similar spectral distortion measures have been reported by other researches [33] [39].
In addition, excitation spectrum, RMS-energy, F0 and duration distances have also been
used to measure excitation, energy, fundamental frequency and duration conversions [23].
Mel Cepstral Distortion (MCD) is another objective error measure used, which seems
to have a correlation with the subjective test results [35]. Thus MCD is used to measure
the quality of voice transformation [34]. MCD is related to vocal characteristics and
hence, is an important measure to check the performance of the mapping obtained by
ANN/GMM network. MCD is essentially a weighted Euclidean distance defined as:
MCD = (10/ln10) ∗
√√2 ∗
25∑k=1
(cek − ct
k)2 (1.3)
where cti and ce
i denote the target and the estimated Mel-cepstral coefficients, respectively.
7
Subjective measures
Subjective measures are based on collecting human opinions and analyzing them. Their
advantage is that they are directly related to human perception, which is typically the
standard for judging the quality of transformed speech. Their disadvantages are that they
are time-consuming, expensive, and difficult to interpret.
Two popular identity tests are:
1. Mean opinion score (MOS): This test is used to evaluate the naturalness and in-
telligibility of converted speech. In this test, the participants are asked to rank the
transformed speech in terms of its quality and/or intelligibility. This is similar to
the similarity test, but the major difference lies in the fact that we concentrate on
the speaker characteristics in the similarity test and intelligibility in the MOS score.
2. Similarity test: The MOS score does not determine how similar the transformed
speech and the target speech are. Hence, similarity measure is used, where the
participants are asked to grade on a scale of 1 to 5, as to how close the transformed
speech is, to the target speakers speech. A score of 5 means that the transformed and
the target speech sound as if spoken by the same speaker and a scale of 1 indicates
that both the utterances sound to be from totally different speakers.
1.4 Voice conversion using non-parallel data
Most of the voice conversion techniques use parallel corpora for training i.e, the source
speaker and the target speaker record the same set of utterances. In a realistic voice
conversion application, only non-parallel corpora may be available during the training
phase. Since it is not always feasible to find parallel utterances for training, methods
were proposed with the goal of reducing the recordings from the source speaker. All
such methods use non-parallel training data, the goal of which is to find a one-to-one
correspondence between the frames of the source and target speaker. The different kinds
of methods that work with the non-parallel data are explained ahead.
8
1. Class mapping: In this method, the source and target vectors are separately classi-
fied into clusters using vector quantization. It involves two levels of alignment:
(a) First level: Each source speaker acoustic class is aligned to one of the target
speaker acoustic classes by searching the closest frequency-warped centroid.
(b) Second level: The vectors inside each class are mean-normalized and frame-
level alignment is performed by finding the nearest neighbour of each source
vector in the corresponding target class.
This technique was evaluated using objective measures, and it was found that the
performance of this method was not as good, when compared to using parallel
data [40]. However this method was proposed as a starting point for further im-
provements that led to the development of the dynamic programming method.
2. Speech recognition: Typically, speech recognition systems use a set of speaker-
independent HMMs to model the parameters of speech signal. In this technique [41],
speaker-independent HMMs are used to label the source and target speaker utter-
ances at frame level, with state indices. Given the state sequence of one speaker, the
alignment procedure consists of finding the longest matching sub-sequences from
the other speaker, until all the frames are paired. The HMMs used for this task give
good results for intra-lingual alignment. However, the suitability of such models
for cross-lingual alignment tasks has not been tested yet.
3. Pseudo parallel corpora created for TTS: In some applications, like customiza-
tion of a text-to-speech synthesizer, a huge database of speech from the source
speaker is available. So, the TTS system can be used to generate the same sentences
that have been recorded from the target speaker. Given that a parallel training cor-
pus is now available, the parameter vectors can be aligned by DTW or HMM. The
main disadvantage of this method is that it can be applied only when there is enough
data from the source speaker to build a TTS system. This strategy is incompatible
with cross-lingual applications [32].
4. Dynamic programming: This method is based on the unit selection paradigm.
Given a set of N source vectors S , dynamic programming is used to find the se-
9
quence of N target vectors T that minimize the acoustic distance between two
speakers. The distance measure is computed by a cost function such as the one
used in TTS systems to concatenate two units. In a unit selection based TTS sys-
tem, there are two costs involved: target cost and concatenation cost. However, in
TTS systems the target cost considers the distance between the acoustic, prosodic
and phonetic characteristics of the target units and those predicted by the TTS it-
self, according to previously trained models. Whereas in this alignment system, the
target cost considers only the acoustic distance between the vectors of the source
speaker and those of the target speaker [37] [42] [43].
One important advantage of the alignment technique based on dynamic program-
ming is, that it establishes the correspondence between frames using only acous-
tic information. Its performance is satisfactory even for cross-lingual applications.
However, it has two drawbacks: (a) it is very time-consuming, and (b) increasing
the size of the training database implies worsening the conversion scores, since the
optimal sequence of the target speaker is closer to the source speaker when there
are more frames available for selection.
Therefore, a new method for estimating pseudo-parallel data was proposed in [9]. A
nearest neighbor of each source vector in the target acoustic space, and the nearest
neighbor of each target vector in the source acoustic space, allowing one-to-many
and many-to-one alignments were mapped. When a voice conversion system using
GMM was trained on such aligned data it was observed that an intermediate con-
verted voice was obtained. That is, it was neither recognized as a source speakers’
voice nor as the target speakers’ voice. When this proposed approach was applied
on the transformed data and the target speaker data, it resulted in an output closer to
the target speaker than the previously transformed sentences. If this procedure was
followed iteratively, the final voice was found to converge to the target speaker’s
voice.
5. Adaptation technique: This technique is based on building a transformation mod-
ule on the existing parallel data of an arbitrary source-target speaker pair and then
adapting this model to a particular pair of speakers for which no parallel data is
10
available [44]. Suppose A and B are the two speakers between whom we need to
build a transformation function, but the recorded utterances by these speakers are
not parallel. Suppose we also have parallel recorded utterances from speakers C
and D. We could then estimate a transformation function between speakers C and
D and use adaptation techniques to adapt the conversion model to speakers A and
B.
Cross-lingual voice conversion is the most extreme situation in terms of alignment.
Voice conversion systems dealing with different languages have some special require-
ments because the utterances available for training are characterized by different phoneme
sets. Obviously, the main difference between intra-lingual and cross-lingual alignment is
that it is not possible to obtain parallel corpora from utterances in different languages, so
the most popular alignment strategies are not valid anymore. On the other hand, it can be
remarked that training cross-lingual voice conversion functions would not be problematic
at all if the alignment problem was solved.
1.5 Limitations of the current systems
In the previous section, the state-of-the-art speech modelling and feature transformation
techniques employed in voice conversion framework have been discussed. The existing
methods have been shown to work reasonably well and are capable of achieving convinc-
ing identity transformations when a pair of speakers with similar characteristics are in-
volved. However, if the conventional conversion techniques are extended to more extreme
applications, such as cross-lingual voice conversion, emotion conversion and speech re-
pair, results are far from convincing.
1.5.1 Limitations using parallel data
One of the main limitations of current voice conversion systems is to have both, the source
and the target speakers record a matching set of utterances, referred to as parallel data. A
11
mapping function obtained on such parallel data can be used to transform spectral char-
acteristics from the source speaker to the target speaker [12] [33] [16] [20] [36] [45] [46].
However, the use of parallel data has many limitations:
1. If either of the speakers changes, then a new transformation function has to be
estimated which requires collection of parallel data from a new speaker.
2. If there are differences between the utterances of source and target speakers in terms
of recording conditions, duration, prosody, etc., then it introduces alignment errors,
which in turn leads to a poorer estimation of the transformation function.
3. Collection of parallel data is not always feasible. Collecting a parallel set of record-
ings from both the speakers in a naturally time aligned fashion [30] is a costly and
time consuming task.
4. When applying voice conversion in a speech-to-speech translation, we desire the
target voice that is synthesized by a text-to-speech system to be identical to the
source speakers’ voice. Since source and target languages are different, it is very
unlikely to have parallel utterances of both speakers. We can classify this problem
as the need to acquire training data for cross-lingual voice conversion.
1.5.2 Limitations using non-parallel data
Section 1.3.4 explains the methods which align non-parallel data for training a voice con-
version system. While these techniques avoid the need for parallel data, they still require
speech data (non-parallel data) from the source speakers apriori to build the conversion
models. This is a limitation to an application where an arbitrary user intends to transform
his/her speech to a pre-defined target speaker without recording anything apriori. Thus, it
is worthwhile to investigate conversion models which capture the speaker-specific char-
acteristics of a target speaker and avoid the need for speech data from source speaker for
training. Such conversion models not only allow an arbitrary speaker to transform his/her
voice to a pre-defined target speaker but also find applications in cross-lingual voice con-
version systems.
12
1.6 Objective and scope of the work
The main objective of this work is to alleviate the requirement of source speaker data in
intra-lingual voice conversion and reduce the complexity in obtaining training data for a
cross-lingual voice conversion system. We propose a method to capture speaker specific
characteristics of a target speaker. Such a method needs to be trained only on target
speaker data and hence any arbitrary source speakers speech could be transformed to the
specified target speaker.
Desai et.al, 2010 [1] and Prahallad, 2010 [2] proposed a method to capture the speaker-
specific characteristics of a target speaker. To our knowledge, this is the only work done
previously which does not require source data in apriori. They used an ANN model to
capture the speaker-specific characteristics. The core idea of this work is as follows.
Let L and S be two different representations of the target speaker’s speech signal. A
mapping function Ω(L) could be built to transform L to S . Such a function would be
specific to the target speaker and could be considered as capturing the essential speaker-
specific characteristics. The choice of representations L and S play an important role
in building such mapping networks and their interpretation. In their work, they assume
that L represents speaker-independent (linguistic) information, and S represents linguistic
and speaker information. Then a mapping function from L to S should capture speaker-
specific information in the process. They used first six formants, their bandwidths, and
delta features as the representation of L. The formants undergo a normalization technique
such as vocal tract length normalization (VTLN) to compensate for the speaker effect. S
is represented by traditional mel-cepstral features (MCEPs). They introduce a concept
of an error correction network which is essentially an additional ANN network, used to
map the predicted MCEPs to the target MCEPs so that the final output features represent
the target speaker in a better way. A schematic diagram of the training and conversion
modules is shown in Fig. 1.4. Notice that during training, only the target speakers data is
used. The limitations of this work are:
• The formants are used to represent the language information (L) in speech signal.
So, it is necessary to extract correct formants from speech signal. But it is very
13
difficult to find a method to extract exact formants from a given signal [47].
• Theoretically speaking, the number of formants vary from phone to phone. How-
ever, in this work, 6 formants are used for every phone. So, it is not the optimal
representation for a phone.
• VTLN is used to normalize speaker effect in formants. This method does not work
without VTLN.
• This work uses an error correction network to improve the performance of the sys-
tem. It is a separate ANN mapper which adds more computations and parameters
to the system.
In this work, we investigate alternatives such as articulatory features for speaker-independent
representation of speech signal.
Target speaker data
VTLN
MCEP
Formants & B.Ws
ANN (error correction)
ANN VTLN Formants
& B.Ws Source speaker
data
ANN
TRANINIG
CONVERSION
Fig. 1.4: Flowchart of the training and conversion modules of a voice conversion systemcapturing speaker-specific characteristics. Notice that during training, only the target speakersdata is used. Adopted from [2]
1.7 Contributions of this thesis
In this thesis, we proposed articulatory features (AFs) as the canonical form or speaker-
independent representation of speech signal as they are assumed to be speaker indepen-
dent. AFs used in this work represent the characteristics of the speech production process
14
like manner of articulation, place of articulation, lip rounding, etc,. These features are
motivated by human speech production mechanism. Chapter 2 briefly explains about AFs
and how can be extracted AFs from a given speech signal. These features have been used
for automatic speech recognition (ASR) with the aim of better pronunciation modeling,
better co-articulation modeling, robustness to cross speaker variation and noises, multi-
lingual and cross-lingual portability of systems, language identification and expressive
speech synthesis. In these studies, often the articulatory features derived from the acous-
tics are treated as generic or speaker-independent representation of the speech signal. But
we show that AFs contain significant amount of speaker information in their trajectories.
Thus, we propose suitable techniques to normalize the speaker-specific information in AF
trajectories and the resultant AFs are used for voice conversion.
1.8 Organization of the thesis
The contents of this thesis are organized as follows: In chapter 2, we briefly explain
articulatory features and the features we use in this work. The methods to extract these
features from a given speech signal are also discussed. We summarize previous research
on the use of articulatory features for various speech systems.
In chapter 3, we analyze the speaker-specific information in articulatory features by
conducting speaker identification experiments with gaussian mixture models. We show
that AFs contain significant amounts of speaker information in their contours. We pro-
pose a technique to normalize the speaker-specific information in the AFs. Finally, we
conclude that AFs have to be normalized speaker-specific information before using them
in voice conversion as canonical form of a speech signal.
chapter 4, proposes a new method that captures speaker specific characteristics and
hence resolves the issue of requiring source speaker data for voice conversion training.
Finally, we conclude this chapter with experiments and results of this method when tested
in a cross-lingual voice conversion scenario.
In chapter 5, we summarize the contributions of the present work, and highlight some
15
issues arising out of the study.
16
Chapter 2
Articulatory features
This chapter introduces the concept of articulatory features that are used in this work and
methods to extract these features from a given speech signal. Section 2.1 gives a brief in-
troduction about the human speech production process, and the role articulatory features
play in describing it. In Section 2.2, we describe different types of articulatory features
(AFs) and the type of articulatory features that are modeled in this work. Section 2.3 ex-
plains the extraction of AFs from speech signal using ANNs and discusses some objective
measures used to evaluate the accuracy of extracted AFs. The summary of this chapter is
presented in Section 2.4.
2.1 Human speech production
The production of human speech is mainly based on the modification of an egressive
air stream by the articulators in the human vocal tract [3]. The activity of the vocal
organs in making a speech sound is called articulation. It involves three major processes:
1)The air stream process, 2)The phonation process, and 3)The configuration of the vocal-
tract (oro-nasal process). The Air stream process describes how sounds are produced and
manipulated by the source of air. The pulmonic egressive mechanism is based on the air
being exhaled from the lungs while the pulmonic ingressive mechanism produces sounds
while inhaling air. Ingressive sounds, however, are rather rare. The Phonation process
Artificial Neural Network (ANN) models consist of interconnected processing nodes,
where each node represents the model of an artificial neuron, and the interconnection be-
tween two nodes has a weight associated with it. ANN models with different topologies
perform different pattern recognition tasks. For example, a feed-forward neural network
can be designed to perform the task of pattern mapping, whereas a feedback network
could be designed for the task of pattern association. A multi-layer feed forward neural
network is used in this work to obtain the mapping function between the acoustic and the
articulatory vectors.
Figure. 2.2 shows the architecture of a five layer ANN used to capture the transforma-
tion function for mapping the acoustic features onto the articulatory space. The ANN is
trained to map the MCEPs vector to an AF vector, i.e., if G(xt) denotes the ANN mapping
of xt , then the error of mapping is given by ε =∑
t ||yt −G(xt)||2. G(xt) is defined as
G(xt) = g(w(4)g(w(3)g(w(2)g(w(1)xt)))), (2.1)
where
g(κ) = κ, g(κ) = a tanh(bκ) (2.2)
22
Input Layer Output Layer Layer
Compression
Layer 1
2
3
4
5
activation L N N N L
Type of
function
of nodes
Number P P P P P1 5432
Fig. 2.2: Architecture of a five-layered MLFFNN with number of nodes in each layer and typeof activation function.
Here w(1), w(2), w(3), w(4) represents the weight matrices of the first, second, third and
fourth hidden layers of ANN respectively. The values of the constants a and b used
in the tanh function are 1.7159 and 2/3 respectively. A generalized back propagation
learning [20] is used to adjust the weights of the neural network so as to minimize ε, i.e.,
the mean squared error between the desired and the actual output values. Selection of
initial weights, architecture of ANN, learning rate, momentum and number of iterations
are some of the optimization parameters in training an ANN [25]. Once the training
is complete, we get a weight matrix that represents the mapping function between the
acoustic features and articulatory features. Such a weight matrix can be used to transform
a feature vector from acoustic space to a feature vector of the articulatory space.
Input and Output Representation
In this work, MCEPs are used as inputs to train the ANN mapper. The use of excitation
features or other representations of speech signal have not been explored in the scope of
this work.
To train a MCEP-to-AF mapper, a representation for AF is required for each MCEP
vector. Such knowledge could be obtained by phonetic segmentation of speech obtained
manually or automatically. The utterances in the TIMIT database have time stamps at
23
phone level. This is used to know the beginning and ending of each phone in the utter-
ance. Also, given a phone symbol, we relied on its phonological properties to derive an
AF representation. This representation is binary in nature and the number of bits used
to denote this representation is explained in Table 2.1. For example, the 1st bit in the
AF representation could take a value 1 or 0 based on whether the phone is voiced or un-
voiced. Thus an ANN model is trained to map an MCEP vector to the corresponding
phonologically derived AF. Although the training of the ANN model is done using a bi-
nary representation at the output layer, the final output of the ANN model is continuous.
That is, the output of the ANN model at each node is a continuous value in the range of
0 and 1, as shown in Fig. 2.3. Figure 2.3.(b) is shows the phonologically derived AFs
which are binary values, where black color corresponds to bit value 1 and white color
corresponds to bit value 0. Figure 2.3.(c) is shows the acoustically derived AFs which
are continuous values varying from 0 to 1. The implicit assumption in the representation
of binary AFs for a phone is that, speech production of one phone is independent of the
other. So, phonologically derived AFs are discrete. But actual speech production is con-
tinuous in nature. The production of one phone is dependent upon the next phone. So,
acoustically derived AFs are continuous.
This difference in the expected and the actual values at the nodes in the output layer
could be attributed to contextual effects of the phones which are not captured in the phono-
logically derived AF representation. Thus, the output of the ANN model is treated as an
acoustically derived AF representation which encapsulates co-articulation, emotion and
speaker characteristics [68].
The structure of the ANN model used is 25L 50N 12L 50N 26L, where the integer
value indicates the number of nodes in each layer and L / N indicates the linear or non-
linear activation function. It is a five layer feed forward neural network that consists of
three hidden layers. Generally, we use the dimension of expansion layer to be equal to
two times the dimension of the input and size of the compression layer to be almost half
of the dimension of input layer.
24
Fig. 2.3: (a) Waveform of the sentence “The angry boy answered but they didn’t look up.”,(b) Expected output in binary (phonologically derived AFs). (c) Actual output in continuous(acoustically derived AFs).
MCEPs –> AFs AFs –> MCEPs
Analysis map Synthesis map
Architecture of MLFFNN 25L 50N 20L 50N 26L
MCEPs estimated MCEPsAFs
26L 50N 20L 50N 25L
Fig. 2.4: Block diagram representation of both analysis and synthesis of AFs.
2.3.4 Evaluation of mapping accuracy
Measuring cepstral distortion
To evaluate how good the AFs are predicted from MCEPs, we used another ANN to map
the predicted AFs to original MCEPs. This mapping can be called the synthesis phase,
and the whole frame is referred to as analysis-by-synthesis as shown in Fig. 2.4. The
structure of the ANN model used is 26L 50N 12L 50N 25L. The performance of analysis-
by-synthesis approach could be measured by Mel-cepstral distortion (MCD) computed
between the output of synthesis phase and the original MCEPs. MCD is related to fil-
ter characteristics and hence, is an important measure to check the performance of the
25
Table 2.2: Average MCD and MOS scores of analysis-by-synthesis approach.
Approach MCD MOS Similarity testanalysis-by-synthesis 4.604 3.97 4.44
mapping obtained by an ANN model. The MCD is computed as follows:
MCD = (10/ln10) ∗
√√2 ∗
24∑i=1
(coi − ce
i )2 (2.3)
where coi and ce
i denote the original and the estimated mel-cepstral, respectively [18].
Given that the MCEP-to-AF and AF-to-MCEP mapping networks are trained on TIMIT
training set, we computed the MCD for all utterances in the testing set (1344 utterances).
We synthesized 10 utterances by predicted MCEPs and original F0, using the mel-log
spectral approximation (MLSA) filter [67]. For all the experiments done in this work, we
have used pulse excitation for voiced sounds and random noise excitation for unvoiced
sounds.
We conducted mean opinion score (MOS) and similarity tests to evaluate the perfor-
mance of analysis-by-synthesis method. These are subjective evaluations where listeners
evaluate the speech quality of the synthesized speech using a 5-point scale (5: excellent,
4: good, 3: fair, 2: poor, 1: bad) and closeness of synthesized speech with original speech
signal. Table 2.2, shows the average MCD, MOS and similarity test scores for all test
sets. The MOS and similarity scores were obtained from 10 subjects, each performing
the listening tests on 10 utterances. An analysis drawn from these results shows that AFs
do capture sufficient information of speech signal. It is typically observed that an MCD
score less than 6.0 produces good quality speech in speech synthesis/voice conversion.
Measuring frame-wise recognition
The evaluation method used is a comparison of overall accuracy in terms of frame error
rate (FER) together with insertion and deletion. FER is widely used for articulatory fea-
ture extraction evaluation [69]. This is because, in current speech technology, articulatory
features are commonly used as an alternative or additional speech representation. Speech
26
Table 2.3: Frame-wise recognition using TIMIT database.
labio-dental, dental, velarConsonants Place + Manner 84.44
3.2 Normalizing speaker specific information
In order to normalize the speaker specific information in AF streams, we have experi-
mented with mean smoothing the AF trajectories with a 5-point and an 11-point window.
The idea is to smooth the correlations among the samples in the AF trajectories so that
the smoothed trajectories normalize the effect of speaker-specific characteristics on the
AF streams.
1 pau sh iyhh ae d y
ao
rd
aar
ks uw t ih n
gr iy
s iy w aa sh w aot er ssil ao
ly
ihr pau
Unsmoothed articulatory feature contours
1
0 0.5 1 1.5 2 2.5 3 3.5
1
1
smoothed articulatory feature contours
1
0 0.5 1 1.5 2 2.5 3 3.5
1
Time (Sec)
Speaker1 Speaker 2
(a)
(b)
(c)
(f)
(e)
(d)
Fig. 3.1: (a),(b) and (c) show unsmoothed AF contours of stops, fricatives, approximants forspeaker-1 and speaker-2, respectively. (d),(e) and (f) show smoothed AF contours of Stops,Fricatives, Approximants for speaker-1 and speaker-2, respectively.
Figure 3.1 shows the unsmoothed and smoothed contours of stops, fricatives and ap-
36
proximants for two different speakers. Unsmoothed contours have very small variations
which are different among the speakers. After applying a mean smooth window on these
contours these variations are smoothed out. Smoothed contours represent finer represen-
tation of AFs similar among the speakers.
Figure 3.2 shows the speaker identification performance after applying the mean-
smoothing repeatedly for five times. It could be observed that the performance of SID
system decreases with every iteration of mean-smoothing and more so for the 11-point
window spanning 225 milliseconds (frame shift is 5ms).
0 1 2 3 4 520
30
40
50
60
70
80
90
100
Level of smoothing
Accuracy (%)
5−point mean−smoothing window
11−point mean−smoothing window
Fig. 3.2: Speaker identification accuracies for different levels of smoothing by 5-point and 11-point mean-smoothing window. Level ‘k’ corresponds to applying mean-smoothing window ‘k’times.
A relevant question here is – Do these smoothing operations reduce only speaker in-
formation, or speech information as well? To study the effect of mean-smoothing on the
speech quality, we built an AF-to-MCEP mapper after every iteration of smoothing. This
mapper was tested on the held-out test set and an MCD score was computed as described
in Section 2.3.4. In Fig. 3.3, we show the normalized accuracies of speaker identification
and MCD scores with initial accuracy (85.24%) and initial MCD score(4.604), respec-
tively. It can be seen that, after five iterations of mean-smoothing of AFs, the accuracy
of speaker identification decreases by 0.6 times. However, the MCD score is increased
by only 0.2 times, indicating that the loss of spectral information is lesser than that of
speaker information.
37
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1 2 3 4 5
Accuracy
MCD score
No
rmal
ized
per
form
ance
No. of times 11-point window is applied
Fig. 3.3: Speaker identification accuracies and MCD scores for different levels of smoothing.Level ‘k’ corresponds to applying 11-point mean-smoothing window ‘k’ times. All scores arenormalized with respect to scores without smoothing (Level 0).
3.2.1 Use of smoothed AFs for speech recognition
The goal of a speech recognition system is to print the textual message in the speech sig-
nal. Such systems have to work for all speakers in all environments. So, it is necessary to
normalize the speaker-specific information in speech signals before using them in speech
recognition. Otherwise it acts like noise to the system, and performance reduces signif-
icantly. In this section, we talk about the speech recognition experiments we have con-
ducted, using MCEPs, unsmoothed AFs and smoothed AFs. It is mono-phoneme based
speech recognition. We trained context independent HMM models for each phoneme by
using 16 gaussian mixtures with the help of the HMM-toolkit (HTK). We used the train-
ing directory (468 speakers) of TIMIT database for training and the testing directory (162
speakers) was used to test the system. Table 3.3 shows the performance of the system
using these three features. From this table one can observe that, by smoothing AFs itera-
tively the performance of the system increased gradually. After after the 4th iteration the
accuracy of the system decreased a little, which mean that the speech information was
also begining to be lost and it suggested to us to stop the smoothing after 4 iterations.
38
These experiments conclude that AFs contain speaker characteristics and this could be
reduced by smoothing.
Table 3.3: Phone recognition accuracies using MCEPs and smoothed AFs. AFs-‘k’ corre-spond to applying 11-point mean-smoothing window ‘k’ times.
Features AccuracyMCEPs 52.16%AFs-0 26.32%AFs-1 40.68%AFs-2 42.94%AFs-3 44.24%AFs-4 44.65%AFs-5 44.05%
3.3 Summary
This chapter, described a speaker identification approach using only AFs. We modeled
each speaker using GMMs. To estimate the parameters of GMMs, EM algorithm was
used. The results based on the total TIMIT corpus have shown that AFs contain significant
speaker specific information, and more so in the AF streams of consonants. To remove the
speaker specific information in AFs we smoothed the trajectories of AFs and used them
in speaker identification. Result show a, significant decrease in the performance of the
speaker identification system using smoothed AFs. It is also shown that smoothed AFs
perform better than the unsmoothed AFs, for speech recognition.
39
Chapter 4
Use of articulatory features for voice
conversion
Chapters 2 and 3 showed the extraction of articulatory features (AFs), analysis of speaker
information in AFs and normalization of speaker information in AFs. In this chapter,
we propose a voice conversion method using articulatory features (AFs) which are used
to capture speaker-specific characteristics of a target speaker. Such a method avoids the
need for speech data from a source speaker and hence could be used to transform an
arbitrary speaker including a cross-lingual speaker. The basic idea used in this work is
shown in block diagram 4.1. It involves two steps: 1) Projecting the source speaker space
into a speaker-independent space where it has only the message part of the signal. 2)
Mapping the speaker-independent space to target speaker space by using an ANN mapper
which captures the target speaker-specific characteristics. Here, AFs are used to represent
speaker independent space in the process of capturing the speaker-specific characteristics
of a target speaker.
Section 4.1 explains a model which is used to capture speaker-specific characteristics
of a target speaker and the mathematical representation of that model. Section 4.2 de-
scribes the use of such a model in intra lingual voice conversion using the AFs and also
discusses both subjective and objective evaluations used to evaluate the performance of
the system. In section 4.3 we discuss how we can extend that model for cross lingual
40
Source speaker space
Projecting
Speaker independent
space Mapping
Target speaker space
Fig. 4.1: Mapping of arbitrary source speaker into target speaker
voice conversion using AFs.
4.1 Noisy-channel model
As discussed in chapter 1, the assumption of existence of parallel or pseudo-parallel data
is not valid for many practical applications. Hence, we posed an alternative, but relevant
research question, which is – “How to capture speaker specific characteristics of a target
speaker from the speech signal (independent of any assumptions about a source speaker)
and impose these characteristics on the speech signal of an arbitrary source speaker to
perform voice conversion?”. The problem of capturing speaker specific characteristics
can be attempted by the following method.
The problem of capturing speaker-specific characteristics can be viewed as modeling
a noisy-channel [2]. Suppose, C is a canonical form of speech signal i.e., a generic and
speaker-independent representation of the message in speech signal which passes through
the speech production system of a target speaker to produce a surface form S . This surface
form S carries the message as well as the identity of the speaker.
One can interpret S as the output of a noisy-channel, for the input C. Here, the noisy-
channel is the speech production system of the target speaker. The mathematical formu-
lation of this noisy-channel model is –
argmax︸ ︷︷ ︸S
p(S/C) = argmax︸ ︷︷ ︸S
p(C/S )p(S )p(C) (4.1)
= argmax︸ ︷︷ ︸S
p(C/S )p(S ) (4.2)
as p(C) is constant for all S . Here p(C/S ) could be interpreted as a production model.
41
Speaker coloring Ω(C)
C S’
Canonical form (articulatory features)
Surface form (Mel-cepstrum)
Fig. 4.2: Capturing speaker-specific characteristics as a speaker-coloring function.
p(S ) is the prior probability of S and it could be interpreted as the continuity constraints
imposed on the production of S . It could be seen as analogous to a language model of S .
In this work, p(S/C) is directly modeled as a mapping function between C and S
using artificial neural networks (ANN). The process of capturing speaker-specific charac-
teristics and its application to voice conversion is explained below:
Suppose, we derive two different representations C and S from the speech signal with
the following properties: Let, C be a canonical form of speech signal, i.e., a generic
and speaker-independent form - approximately represented by articulatory features (AFs)
extracted from speech signal. Let S be a surface form represented by Mel-cepstral co-
efficients (MCEPs). If there exists a function Ω(.) such that S ′ = Ω(C), where S ′ is an
approximation of S - then Ω(C) can be considered as specific to a speaker. The function
Ω(.) could be interpreted as speaker-coloring function. We treat the mapping function
Ω(.) as capturing speaker-specific characteristics. It is this property of Ω(.), we exploit
for the task of voice conversion. Fig. 4.2 depicts the concept of capturing speaker-specific
characteristics as a speaker-coloring function.
4.2 Intra lingual voice conversion
4.2.1 Database
The experiments here were carried out on the CMU ARCTIC database consisting of ut-
terances recorded by seven speakers. Each speaker has recorded a set of 1132 phonet-
ically balanced utterances [70]. The ARCTIC database includes utterances of SLT (US
Given the utterance of a target speaker T , the corresponding canonical form CT of the
speaker is represented by AFs. To alleviate the effect of speaker characteristics, the AFs
undergo a normalization technique such as smoothing, as explained in Section 3.2. The
surface form S T is represented by traditional MCEP features, as it would allow us to syn-
thesize using the MLSA synthesis technique. The MLSA synthesis technique generates
a speech waveform from the transformed MCEPs and F0 values using pulse excitation or
random noise excitation. An ANN model is trained to map CT to S T using backpropaga-
tion learning algorithm by minimizing the Euclidean error ||S T −S ′T ||, where S ′T = Ω(CT ).
4.2.3 Conversion process
Once the target speaker’s model is trained, it can be used to convert CR to S ′T where CR is
the canonical form from an arbitrary source speaker R. To get the canonical form for any
arbitrary source speaker, one of the three methods below, could be followed. The process
to build any of the encoders is below, explained in Section 2.3.3.
1. Use source speaker encoder. This requires building an encoder specific to a source
speaker, and hence a large amount of speech data (along with transcription) from
the source speaker is required.
2. Use target speaker encoder. This maps MCEPs of an arbitrary source speaker onto
AFs using target speaker’s encoder.
3. Use average speaker encoder. This maps MCEPs of an arbitrary source speaker
onto AFs using an average speaker encoder which is trained using all speakers’
data except that of source and target speakers. Since an average model is used
to generate AFs, a form of speaker normalization takes place on AFs even before
smoothing is applied.
43
0
1
2
3
4
5
6
7
8
9
10
RMS to SLT BDL to SLT SLT to BDL RMS to BDL
Target speaker encoder
Source speaker encoder
Average speaker encoder
Me
l-ce
pst
ral D
isto
rtio
n in
dB
Fig. 4.3: Plot of MCD scores obtained between different speaker pairs.
4.2.4 Validation
By using the three methods discussed in Section 4.2.3 we predicted the AFs for three
source speakers (SLT, BDL and RMS). The AFs were smoothed to normalize speaker-
specific information. Smoothed AFs were mapped onto the BDL and SLT speaker-
specific model. To test the effectiveness of the voice conversion model, we calculated
the Mel-cepstral distortion (MCD) between predicted MCEPs and actual MCEPs. MCD
is a standard measure used in speech synthesis and voice conversion evaluations [18].
Fig. 4.3 shows the MCD scores obtained using the three methods. We observe that the
average speaker encoder gives lesser MCD score compared to the other two methods.
This suggests that the use of an average speaker encoder generates normalized AFs, and
smoothing the AF trajectories further helps in realizing the speaker-independent form.
Rest of the experiments were carried out using average speaker encoder to get the canon-
ical form for any arbitrary source speaker. It can also be used to get the canonical form
for target speaker.
44
Target Speaker Data
Encoder (MCEPs AFs)
Smoothing
Target speaker ANN
Source Speaker Data
Smoothing
TRAINING
CONVERSION
Target speaker ANN
MCEPs
AFs MCEPs
AFs
(a)
(b)
Predicted target MCEPs’
(input)
(output)
Encoder (MCEPs AFs)
Fig. 4.4: Flow-chart of the training and conversion modules of a voice conversion systemcapturing speaker-specific characteristics.
4.2.5 Experiments on multiple speaker database
To test the validity of the proposed method, we conducted experiments on other speakers’
database from the CMU ARCTIC set, such as RMS, CLB, AWB, and KSP. Fig. 4.4.(a)
shows the block diagram for training process and Fig. 4.4.(b) shows the block diagram
for the conversion processing. Table 4.1 provides the results of mapping CR (where R =
BDL, RMS, CLB, AWB, KSP, SLT voices) onto the acoustic space of SLT and BDL.
Table 4.1: MCD scores obtained between multiple speaker pairs with SLT and BDL as targetspeakers. Scores in parenthesis are obtained using parallel data.
Table 4.2: Subjective evaluation of voice conversion models built by using parallel and Noisy-channel models.
Transformation using SLT to BDL BDL to SLTParallel data 3.34 3.58Noisy-channel model 3.14 3.40
By using smoothed AFs we can transform any arbitrary speaker into a predefined
target speaker without the need of any utterance from a source speaker, in a training voice
conversion model. This indicates that the methodology of training an ANN model to
capture speaker-specific characteristics for voice conversion could be generalized over
different datasets.
4.3 Cross-lingual voice conversion
Cross-lingual voice conversion is a task where the language of the source and the target
speakers is different. We employ the same model explained in Section 4.1 to capture
speaker-specific characteristics for the task of cross-lingual voice conversion. We per-
formed an experiment to transform two speakers’ (speaking Kannada and Telugu) utter-
ances into a male voice speaking English (US male - BDL). Our goal here is to transform
two speaker voices to BDL voice. Hence the output will be as if BDL were speaking in
Kannada and Telugu, respectively.
Here, we can pose some research questions, which are:–
• How to extract the canonical form (AFs) for a cross lingual source speaker?
• Can we use the average encoder which is used for intra lingual voice conversion
explained in Section 4.2.3? (It is built by using the data of many speakers of same
language.)
• Do we need to include the data of some other languages in the average model, to
normalize the language information in speech signal?
47
To answer the above questions, we used two encoders to extract the canonical form
for cross lingual source speaker.
1. Use multi-speakers and mono-lingual encoder. It is trained by using many speakers
of the same language, without source and target speakers’ data. It does a form of
speaker normalization in AFs. It is similar to an average speaker encoder model in
intra-lingual voice conversion.
2. Use multi-speakers and multi-lingual encoder. It is trained by using many speakers
of multi language without source and target speakers’ data. This kind of encoder
offers a form of language and speaker normalization in AFs.
The process to build the above encoders is the same as explained in Section 2.3.3. Artic-
ulation of some phones in one language is different from that in other languages. In this
work, we used the phone information to derive the AFs from speech signal. So, it was
necessary to consider the significant articulations in other languages.
To check the performances of both the encoders, we extracted the AFs using two
encoders, separately. AFs extracted using multi-speaker and mono-lingual encoder is
shown in Fig. 4.5. This encoder is the same as that used in intra lingual voice conversion,
which has been built using all of the speakers in the ARCTIC database. AFs extracted
using multi-speaker and multi-lingual encoder is shown in Fig. 4.6. This encoder was built
by using ARCTIC (English) and Telugu data. Since aspiration is a significant articulation
in Indian languages, we used an extra bit to represent this information and the dimension
of AFs increased to 27.
From Fig. 4.5 and Fig. 4.6 it can be observe that prediction of AFs in Fig. 4.5 is not
accurate, whereas most of the AFs are correctly predicted in Fig. 4.6. We can infer that on
use of multiple languages in building a MCEP-to-AFs encoder, some form of language
normalization happens. The prediction of AFs for cross lingual source speaker using such
encoder would be more accurate. In the following experiments we used both the encoders
for a cross lingual voice conversion.
48
Fig. 4.5: (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”, (b) Phono-logically derived AFs. (c) Acoustically derived AFs using multi-speaker and mono-lingual data(English).
4.3.1 Experiments
By using the two encoders above we predicted the AFs for Telugu and Kannada na-
tive source speakers. The AFs were smoothed to normalize speaker-specific information.
Smoothed AFs were mapped onto the BDL speaker-specific model which was built per
the explanation in Section 4.2. Five utterances from two speakers were transformed into
BDL voice and we performed the MOS test and similarity test to evaluate the performance
of this transformation. Table 4.3 provides the MOS and similarity test results averaged
over all listeners. There were ten native listeners of Telugu, and Kannada who participated
in the tests. The similarity tests indicate the closeness of the transformed speech to that of
the target speaker characteristics. Table 4.3 shows the performance using the two methods
mentioned in the previous Section. We observe that the performance using multi-speaker
and multi-language encoder is better than that using the other method. This justifies the
49
Fig. 4.6: (a) Waveform of sentence “enduku babu, annadu pujari ascharyanga!.”, (b) Phono-logically derived AFs. (c) Acoustically derived AFs using multi-speaker and multi-lingual data(English + Telugu).
Table 4.3: Subjective evaluation of cross-lingual voice conversion models. Scores in paren-thesis are obtained using multi-speaker and multi-lingual encoder.
Source Speaker Target Speaker MOS Similarity(Lang.) (Lang.) test