Lee 2006 Speech-Communication

8/9/2019 Lee 2006 Speech-Communication

1/13

Spectral and prosodic transformations of hearing-impaired Mandarin speech

Cheng-Lung Lee a, Wen-Whei Chang a,*, Yuan-Chuan Chiang b

a Department of Communications Engineering, National Chiao-Tung University, Hsinchu 300, Taiwan, ROC b Department of Special Education, National Hsinchu Teachers College, Hsinchu, Taiwan, ROC

Received 21 January 2005; received in revised form 27 July 2005; accepted 17 August 2005

Abstract

This paper studies the combined use of spectral and prosodic conversions to enhance the hearing-impaired Mandarinspeech. The analysis-synthesis system is based on a sinusoidal representation of the speech production mechanism. Bytaking advantage of the tone structure in Mandarin speech, pitch contours are orthogonally transformed and appliedwithin the sinusoidal framework to perform pitch modification. Also proposed is a time-scale modification algorithm thatfinds accurate alignments between hearing-impaired and normal utterances. Using the alignments, spectral conversion isperformed on subsyllabic acoustic units by a continuous probabilistic transform based on a Gaussian mixture model.Results of perceptual evaluation indicate that the proposed system greatly improves the intelligibility and the naturalness

of hearing-impaired Mandarin speech. 2005 Elsevier B.V. All rights reserved.

Keywords: Voice conversion; Prosodic modification; Spectral conversion; Hearing-impaired speaker; Sinusoidal model

1. Introduction

Speech communication by profoundly hearing-impaired individuals suffers not only from the factthat they cannot hear other peoples utterances,

but also from the poor quality of their own produc-tions. Due to the lack of adequate auditory feed-back, the hearing-impaired speakers producespeech with segmental and suprasegmental errors(Hochberg et al., 1983). It is common to hear theirspeech flawed by misarticulated phonemes, with

varying degrees of severity associated with theirhearing thresholds (Monsen, 1978; McGarr andHarris, 1983). Their speech intelligibility is furtheraffected by abnormal control over phoneme dura-tion and pitch variations. Specifically, the duration

of vowels, glides, and nasals were longer while theduration of fricatives, affricates, and plosives wereshorter than in normal speech, and the pitch con-tour over individual syllables is either too variedor too monotonous. Their intonation also showslimited pitch variation, erratic pitch fluctuations,and inappropriate average F0 (Osberger and Levitt,1979). This motivates our research into trying todevise a voice conversion system that modifies thespeech of a hearing-impaired (source) speaker to

0167-6393/$ - see front matter 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2005.08.001

* Corresponding author. Tel.: +886 3 5731826; fax: +886 35710116.

E-mail address: [email protected] (W.-W. Chang).

Speech Communication 48 (2006) 207–219

www.elsevier.com/locate/specom

mailto:[email protected]:[email protected]


2/13

be perceived as if it was uttered by a normal (target)speaker. The technique of voice conversion hasapplications in text-to-speech synthesis (Kain andMacon, 1998) and improving the quality of alaryn-geal speech (Bi and Qi, 1997). Most current systems

(Abe et al., 1988; Stylianou et al., 1998) concentrateon the spectral envelope transformation while theconversion of prosodic features is essentiallyobtained through a simple normalization of theaverage pitch. Such systems may lead to an unsatis-factory speech conversion quality in cases of tonallanguages, such as Chinese, which uses lexical tonesto distinguish meanings of syllables that have thesame phonetic compositions. In view of the impor-tant roles of prosody in Mandarin speech percep-tion, further enhancement is expected by bettermodelling of pitch contour dynamics and by addi-

tionally incorporating prosodic transformation intothe voice conversion system.

The key to solving the problem of voice conver-sion lies in the detection and exploitation of charac-teristic features that distinguish the impaired speechfrom the normal speech (Ohde and Sharf, 1992). Toproceed with this, we found the phonological struc-ture of Chinese language could be used to advan-tage in the search for the basic speech units forprosodic and spectral manipulations. MandarinChinese is a tonal language in which each syllable,

with few exceptions, represents a morpheme (Lee,1997). Traditional descriptions of the Chinese sylla-ble structure divide syllables into combinations of initials and finals rather than into individualphonemes. An initial is the consonant onset of a syl-lable, while a final comprises a vowel or diphthongbut includes a possible medial or nasal ending.Depending on the manner of articulation, initialconsonants can be further categorized into five pho-netic classes including fricatives, affricates, stops,nasals, and glides. To convey different lexical mean-ings, each syllable can be pronounced with fourbasic tones; namely, the high-level tone (tone 1),the rising tone (tone 2), the falling-rising tone (tone3), and the falling tone (tone 4), which are acousti-cally correlated with different fundamental fre-quency (F0) contours and use duration andintensity of the vowel nucleus to provide secondaryinformation. Recent perceptual work on Chinesedeaf speech (Chang, 2000; Lin and Huang, 1997)has shown that speakers with greater than moderatedegrees of losses (P50 dB HL bilaterally) wereperceived with an average accuracy of 31% in

phoneme production, and further, that the most

errors in the consonants were affricates and fric-atives. This finding may have more serious implica-tions for Mandarin than for other languages asthese two phonetic classes make up more than half of the consonants in Mandarin Chinese. Moreover,

since most of them are palatal or produced withoutapparent visual cues, they are difficult to correctthrough speech training. In tone production, theiraccuracy only reached an average of 54%, with mosterrors involving confusions between tones 1 and 4,tones 1 and 2, and tones 2 and 3. The results alsoshowed that tones produced by speakers with pro-found losses were only half as likely to be judged cor-rect as those produced by speakers with less loss.Again, as tones are produced by phonatory, ratherthan articulatory control, they are almost impossibleto correct through non-instrumental-based speech

therapy. In view of the prevalence of the problemsin hearing-impaired Mandarin speech, we proposea subsyllable-based approach to voice conversionthat takes into consideration both the prosodic andthe spectral characteristics. The target applicationof our technique will be in computer-assisted lan-guage learning. As the three aspects of speech, i.e.,spectrum, duration, and pitch, are manipulatedindependently, the hearing-impaired user can duringa certain stage of learning choose to convert one as-pect and use his/her own modified utterance as the

target for focused correction. We believe the usercan better perceive the conversion effect throughauditory comparison of that aspect while errors inthe other two aspects are temporarily ignored. Thisfeature will thus guide the user going through alearning process much simpler than simply givinghim/her an example spoken by a normal speakerwith changes in all three aspects conglomerated.

2. System implementation

The general approach to voice conversionconsists of first analyzing the input speech to obtaincharacteristic features, then applying the desiredtransformations to these features, and synthesizingthe corresponding signal. Essentially, the produc-tion of sound can be described as the output of passing a glottal excitation signal through a linearsystem representing the characteristics of the vocaltract. To track the nonstationary evolution of char-acteristic features, both the spectral and prosodicmanipulations will be performed on a frame-by-frame basis. In this work, speech signals were

sampled at 11 kHz and analyzed using a 46.4 ms

208 C.-L. Lee et al. / Speech Communication 48 (2006) 207–219


3/13

Hamming windows with a 13.6 ms frame shift.Therefore, the analysis frame interval Q was fixedat 13.6 ms. For the speech on the mth frame, the vo-cal tract system function can be described in termsof its amplitude function M (w; m) and phase func-

tion U(w; m). Usually the excitation signal is repre-sented as a periodic train during voiced speech,and is represented as a noise-like signal during un-voiced speech. An alternative approach (McAulayand Quatieri, 1995) is to represent the excitation sig-nal by a sum of K (m) sine waves, each of which isassociated with the frequency wk (m) and the phaseXk (m). Passing this excitation signal through thevocal tract system results in a sinusoidal repre-sentation of speech production. As noted elsewhere(Quatieri and McAulay, 1992), this sinusoidalframework allows flexible manipulation of speech

parameters such as pitch and speaking rate whilemaintaining high speech quality.

A block diagram of the proposed voice conver-sion system is shown in Fig. 1. The system has fivemajor components: speech analysis, spectral conver-sion, pitch modification, time-scale modification,and speech synthesis. The analysis begins by esti-mating from the Fourier transform of input speechthe pitch period P 0(m), the voicing probabilityP v(m), and the system amplitude functionM (w; m). The voicing probability will be used to

control the harmonic spectrum cutoff frequency,wc(m) = pP v(m), below which voiced speech wassynthesized and above which unvoiced speech wassynthesized. The second step in the analysis is torepresent the system amplitude function M (w; m)in terms of a set of cepstral coefficients fclðmÞg

24l¼0.

The main attraction of cepstral representation is

that it exploits the minimum-phase model, wherethe log-magnitude and phase of the vocal tract sys-tem function can be uniquely related in terms of theHilbert transform (Oppenhein and Schafer, 1989).A more comprehensive discussion of the sine-wave

speech model and the corresponding analysis-synthesis system can be found in (McAulay andQuatieri, 1995).

The main part of the modification procedure in-volves the manipulation of functions which describethe amplitude and phase of the excitation andvocal tract system contributions to each sine-wavecomponent. The effectiveness of voice conversiondepends on a successful modification of prosodicfeatures, especially of the time-scale and the pitch-scale. With reference to the sinusoidal framework,speech parameters included in the prosodic conver-

sion are P 0(m), P v(m), and the synthesis frame inter-val. The time-scale modification involves scaling thesynthesis frame of original duration Q by a factor of q(m), i.e., Q 0(m) = q(m)Q. The pitch modificationcan be viewed as a transformation which, when ap-plied to the pitch period P 0(m), yields the new pitchperiod P 00ðmÞ, with an associated change in the F0 asw00ðmÞ ¼ 2p= P

00ðmÞ. It is worth noting also that the

change in pitch period also corresponds to modifica-tion of the sine-wave frequencies w0k ðmÞ and theexcitation phases X0k ðmÞ used in the reconstruction.

Below the cutoff frequency the sine-wave frequen-cies are harmonically related as w0k ðmÞ ¼ kw

00ðmÞ,

whereas above the cutoff frequency w0k ðmÞ ¼k w00ðmÞ þ wu, where k

* is the largest value of k forwhich k w00 6 wcðmÞ, and where wu is the unvoicedF0 corresponding to 100 Hz. A two-step procedureis used in estimating the excitation phase X0k ðmÞ of

Fig. 1. Block diagram of the voice conversion system.

C.-L. Lee et al. / Speech Communication 48 (2006) 207–219 209


4/13

the k th sine wave. The first step is to obtain the on-set time n00ðmÞ relative to both the new pitch period P 00ðmÞ and the new frame interval Q

0(m). This isdone by accumulating a succession of pitch periodsuntil a pitch pulse crosses the center of the mth

frame. The location of this pulse is the onsettime n00ðmÞ at which sine waves are in phase. Thesecond step is to compute the excitation phase asfollows:

X0k ðmÞ ¼ n00ðmÞw

0k ðmÞ þ

0k ðmÞ; ð1Þ

where the unvoiced phase component 0k ðmÞ is zerofor the case of w0k ðmÞ 6 wc(m) and is made randomon [p,p] for the case of w0k ðmÞ > wc(m).

In addition to prosodic conversion, the techniqueof spectral conversion is also needed to modifythe articulation-related parameters of speech. The

problem with the spectral conversion lies with thecorresponding modification of the vocal tract sys-tem function. Thus there is a need to estimate theamplitude function M 0(w; m) and the phase functionU 0(w; m) of the vocal tract system. If it is assumedthat the vocal tract system function is minimumphase (Oppenhein and Schafer, 1989), the log-mag-nitude and phase functions form a Hilbert trans-form pair and hence can be estimated from a setof new cepstral coefficients fc0lðmÞg

24l¼0. The system

amplitudes M 0k ðmÞ and phases U0k ðmÞ are then given

by samples of their respective functions at the newfrequencies w0k ðmÞ, i.e., M 0k ðmÞ ¼ M

0ðw0k ;mÞ andU0k ðmÞ ¼ U

0ðw0k ; mÞ. Finally, in the synthesizer thesystem amplitudes are linearly interpolated overtwo consecutive frames. Also, the excitation and

system phases are summed and the resulting sine-wave phases, h0k ðmÞ ¼ X

0k ðmÞ þ U

0k ðmÞ, are interpo-

lated using the cubic polynomial interpolator. Thefinal synthetic speech waveform on the mth frameis given by

sðnÞ ¼X K ðmÞk ¼1

M 0k ðmÞ cos½nw0k ðmÞ þ h

0k ðmÞ;

t m 6 n 6 t mþ1 1; ð2Þ

where t m ¼Pm1

i¼1 Q0ðiÞ denotes the starting time of

the current synthesis frame.

3. Time-scale modification

As stated earlier (Osberger and Levitt, 1979), thespeech of the hearing-impaired speakers contains

numerous timing errors, including a lower speakingrate, insertion of long pauses, and failure to modifysegment duration as a function of phonetic environ-ment. In Fig. 2 the mean phoneme durationsproduced by the hearing impaired were plottedagainst those produced by the normal speakers.Data were collected from two normal speakers(one male and one female) and three hearing-impaired speakers (one male and two females), allaged 15. The phonemes tested were five fricatives,six affricates, and three vowels. It can be seen that

the mean duration ratios of impaired-to-normalutterances were quite different for different pho-nemes and that vowels, as a group, stayed muchin line with the normal production than the twoconsonant groups, with the mean ratios for vowels,

Fig. 2. Phoneme duration statistics for: (a) vowels and fricatives and (b) affricates.



5/13

fricatives, and affricates being 1.12, 0.4, and 0.34,respectively. All consonants, with the exception of /h/, of the hearing impaired were shorter, as indi-cated by their uniform appearances on the lowerhalf of the graph. Our perceptual judgment showed

that this shortening that could measure 10 to 1 (asseen in /sh/ and /shi/) was the result of substitut-ing the two consonant classes with stops. Thesedeviations can be corrected only when the systemknows what the speaker meant to say. However,in analyzing the articulation errors made by thehearing- impaired speakers, one usually finds a lackof one-to-one substitution pattern between the errorand the target, making it useless for incorporatingautomatic speech recognition into our system. In-stead, our application made use of text prompt onthe computer screen to elicit targets from the user

during testing, a practice that has been widely usedby commercialized software for speech training/correction.

For the converted speech to carry the naturalnessof human speech, the duration of individual pho-nemes needs to match those found in the naturalspeech. This can be done by modifying the intervalof each synthesis frame by a time-varying factorq(m) in a way of Q 0(m) = q(m)Q. The caseq(m) > 1 corresponds to a time-scale expansion,while the case q(m) < 1 corresponds to a time-scale

compression. The next step is to determine thetime-scaling factor q(m) based on spectral represen-tations of the same syllable uttered by the sourceand target speakers. In describing the source speak-ers spectral envelope, cepstral coefficients are mea-sured frame by frame and are of the followingform: X = {x(mx), mx = 1, 2,. . . , T x}, where T x isthe syllable duration in frames. Similarly,Y = { y(m y), m y = 1,2, . . . , T y} is the sequence of T ycepstral vectors representing the target speakersspectral envelope. Acoustic analysis of Mandarinhearing-impaired speech has indicated that un-voiced sound such as consonants may not be sub-

jected to the same scaling as the vowels. Thus fortime-scaling of speech, different approaches shouldbe applied in the time-intervals where the framescorresponding to both speakers were marked asMandarin initials or finals. The boundary betweenthe initial and final parts of an isolated syllable isrelatively easy to detect by a voiced/unvoiced deci-sion based on the voicing probability P v. Let B xand B y represent the starting frame for the finalsubsyllables in the source and target utterances,

respectively. For constituent frames of the initial

consonant, a linear time normalization was appliedwith a fixed factor q = (B y 1)/(B x 1). Withregards to the final subsyllables, two sets of pairedcepstral vectors, {x(mx), B x 6 mx 6 T x} and { y(m y),B y 6 m y 6 T y}, were time aligned using the proce-

dure of dynamic time warping (DTW) (Rabinerand Juang, 1993). Usually the problem of DTW isformulated as a path finding problem over a finiterange of grid points (mx, m y). The basic strategy ap-plied here is to interpret the slope of the DTW pathas a time-scaling function, which indicates on aframe-by-frame basis how much to shorten orlengthen each frame of the source utterance in orderto reproduce the same duration as in targetutterance.

The DTW aims to align two utterances with apath through a matrix of similarity distances that

minimizes the sum of the distances. We begin bydefining a partial accumulated distance DA(mx, m y),representing the accumulated distance along thebest path from the point (B x, B y) to the point(mx, m y). For an efficient implementation, a dynamicprogramming recursion is applied to computeDA(mx, m y) for all local paths that reach (mx, m y)in exactly one step from an intermediate pointðm0 x;m

0 y Þ using a set of local path constraints. Table

1 summarizes the local constraints and slopeweights for three local paths, }1, }2, and }3, chosen

for the implementation. The local distance d (mx, m y)between the time-aligned pairs of cepstral vectors isdefined by a squared Euclidean distance. We sum-marize the dynamic programming implementationfor finding the time-scaling factor at every frameof a final subsyllable as follows:

(1) Initialization: Set DA(B x, B y) = d (B x, B y).(2) Recursion: For B x + 1 6 mx 6 T x and B y +

1 6 m y 6 T y, compute

D Aðm x;m y Þ ¼ minðm0 x;m

0 y Þ

½ D Aðm0 x;m

0 y Þ

þ 1ððm0 x;m0 y Þ; ðm x;m y ÞÞ; ð3Þ

where the incremental distortion 1ððm0 x; m0 y Þ;

ðm x;m y ÞÞ and the intermediate point ðm0 x; m0 y Þ

along three local paths }1, }2, and }3 are givenin Table 1.

(3) Path backtracking: According to the optimalDTW path, we define the time-scaling factorq(m) = 0.5, 1, or 2, for the case where themove from the point ðm0 x;m

0 y Þ to the point

(mx, m y) is via the local path }1, }2, or }3,

respectively.



6/13

4. Pitch modification

The four basic Mandarin tones mentioned earlierhave distinctive shapes of F0 contours, whose per-

ception is correlated with the starting frequency,the initial fall and the timing when the turning pointappears, as involved in tones 2 and 3 (Shen and Lin,1991). Our teenage data supported the generalconclusion with a different measure. Specifically,instead of focusing on the interactions between thefrequency and temporal aspects, we recordedthe frequency differences between the highest andthe lowest point found on the contours. The resultsshowed a clear trend for the normal speakers withthe difference increased when going from tone 1 to

tone 4 (e.g., 19.6, 24.8, 53, 113.1 Hz), which was lessorderly (e.g., 9.3, 1.6, 28, 66.4 Hz) for the impairedspeakers. The most frequent perceptual mistakesmade by our impaired speakers were substitutionsof tone 3 with tone 2, which left only three percep-tual categories 1, 2, and 4 in their tonal repertoire.Unstable tonal productions across recorded tokenswere also common.

Most current approaches to voice conversionmake little or no use of pitch measures, despite evi-dence showing that intonational information ishighly correlated to speech individuality. The mainreason for this is the difficulty in finding an appro-priate feature set that captures linguistically relevantintonational information. This problem is alleviatedin Mandarin speech conversion task as its tonal sys-tem allows relatively non-overlapping characteriza-tions of the corresponding F0 contour dynamics.Speech enhancement can therefore be realized by aproper analysis and control of the F0 contourdynamics. Since pitch is defined only for voicedspeech, the pertinent tone-related portions of sylla-bles are the vowel or diphthong nuclei from which

distinctive pitch changes are perceived. Recognizing

this, we need only to concatenate F0 values of thefinal subsyllable into a vector and represent it by asmall linguistically motivated parameter set. Unlikethe conventional frame-based VQ approaches (Abe

et al., 1988), this segment-based approach makes itpossible to convert not only the static characteristicsbut also the dynamic characteristics of F0 contours.

Choosing an appropriate representation of F0contour is the first step in applying pitch modifica-tion to the voice conversion. By taking advantageof the simple tone structure of F0 contours in man-darin speech, the polynomial curve fitting techniqueis used to decompose the F0 contour into mutuallyorthogonal components in transform domain (Chenand Wang, 1990). The F0 contour can therefore be

represented by a smooth curve formed by orthogo-nal expansion using some low order transformcoefficients. In describing the source speakers F0contour, F0 are measured only for the final subsyl-lable and are in the form of {w0(mx), B x 6 mx 6 T x}.For notational convenience, the F0 contourof a segment with I x + 1 frames is rewritten as{w0(i x), 0 6 i x 6 I x}, where i x = mx B x andI x = T x B x. Parameters for pitch modificationare then extracted from the F0 contour segmentby the orthogonal polynomial transform:

bð xÞ j ¼

1

I x þ 1

X I xi x¼0

w0ði xÞ W ji x

I x

; j ¼ 0; 1; 2; 3.

ð4Þ

Due to the smoothness of an F0 contour segment(Chen and Wang, 1990), the first four discreteLegendre polynomials are chosen as the basis func-tions W j (Æ) to represent it. Based on this orthogonalpolynomial representation, the source F0 contour ischaracterized by a 4-dimensional feature vector,bð xÞ ¼ ðb

ð xÞ0 ; b

ð xÞ1 ; b

ð xÞ2 ; b

ð xÞ3 Þ

T, which will be quantized

using vector quantization (VQ) technique. Similarly,

Table 1Incremental distortions and slope weights for local paths

Path ðm0 x;m0 y Þ 1ððm

0 x;m

0 y Þ; ðm x;m y ÞÞ

}1 (mx 2, m y 1) 1

2d ðm x 1;m y Þ þ

1

2d ðm x;m y Þ

}2 (mx 1, m y 1) d (mx, m y)

}3 (mx 1, m y 2) 1

2d ðm x;m y 1Þ þ

1

2d ðm x;m y Þ



7/13

bð y Þ ¼ ðbð y Þ0 ; b

ð y Þ1 ; b

ð y Þ2 ; b

ð y Þ3 Þ

Tis a feature vector repre-

senting the F0 contour of the target speaker.Our conversion technique is based on the code-

book mapping and consists of two steps: a learningstep and a conversion-synthesis step. In the learning

step, the source and target F0 codebooks were sepa-rately generated using an orthogonal polynomialrepresentation of F0 contours in training utterances.Each of the two codebooks includes 16 codevectorsand is designed using the well-known LBG algorithm(Linde et al., 1980). Next, a histogram of correspon-dence between codebook elements of the two speak-ers is calculated. Using this histogram as a weightingfunction, the mapping codebook is defined as a linearcombination of target F0 codevectors. In the conver-sion-synthesis step, the F0 contour of input speechwas orthogonally transformed and vector-quantized

using the source F0 codebook. Then, the pitch mod-ification was carried out by decoding them using themapping codebook. If the decoded codevector isb̂ ¼ ðb̂0; b̂1; b̂2; b̂3Þ

T, the modified F0 for framemx = i x + B x can be approximated as

w00ði x þ B xÞ ¼X3 j¼0

b̂ j W ji x þ B x I x

; 0 6 i x 6 I x.

ð5Þ

5. Spectral conversion

In addition to prosodic conversion, the generalvoice conversion task also necessitates a mappingof spectral envelopes from one speaker to another.Mandarin is a syllable-timed language in which eachsyllable consists of an initial part and a final part.The primary difficulties in the recognition of Man-darin syllables are tied to the durational differencesbetween the syllable-initial and syllable-final part.Specifically, the initial part of a syllable is shortwhen compared with the final part, which usuallycauses distinctions among the initial consonants indifferent syllables to be swamped by the followingirrelevant differences among the finals. This mayhelp explain why early approaches that usedwhole-syllable models as the conversion units didnot produce satisfactory results for Mandarinspeech conversion. To circumvent this pitfall, weperform spectral conversion only after decomposingthe Mandarin syllables into smaller sound units asin phonetic classes.

The acoustic features included in the conversion

are cepstral coefficients derived from the smoothed

spectrum. The conversion system design involvestwo essential problems: (1) developing a parametricmodel representative of the distribution of cepstralcoefficients, and (2) mapping the spectral envelopesof the source speaker onto those of the target. In the

context of spectral transformation, Gaussianmixture models (GMMs) have been shown to pro-vide superior performance to other approachesbased on VQ or neural networks (Stylianou et al.,1998). Our approach began with a training phasein which all cepstral vectors of the same phoneticclass were collected and used to train the corre-sponding GMM associated with the phonetic classby a supervised learning procedure. We considerthat the available data consists of two sets of time-aligned cepstral vectors xt and yt, corresponding,respectively, to the spectral envelopes of the source

and the target speakers. The GMM assumes thatthe probability distribution of the cepstral vectorsx takes the following parametric form

p ðxÞ ¼X I i¼1

aiNðx; l xi ;R

xxi Þ; ð6Þ

where ai denotes a weight of class i , I = 24 denotesthe total number of Gaussian mixtures, andNðx; l xi ;R

xxi Þ denotes the normal distribution with

mean vector l xi and covariance matrix R xxi . It there-

fore follows the Bayes theorem that a given vector x

is generated from the i th class of the GMM with theprobability:

hiðxÞ ¼ aiNðx; l

xi ;R

xxi ÞP I

j¼1a jNðx; l x j ;R

xx j Þ

. ð7Þ

With this, cepstral vectors are converted from thesource speaker to the target speaker by the conver-sion function that utilizes feature parameter correla-tion between the two speakers. The conversionfunction that minimizes the mean squared errorbetween converted and target cepstral vectors was

given by (Stylianou et al., 1998),

Fðxt Þ ¼X I i¼1

hiðxt Þ½l y i þ R

yxi ðR

xxi Þ

1ðxt l xi Þ; ð8Þ

where for class i , l y i denotes the mean vector for thetarget cepstra, R xxi denotes covariance matrix for thesource cepstra, and R yxi denotes the cross-covariancematrix.

Within the GMM framework, training theconversion function can be formulated as one of the optimal estimation of model parameters k ¼

fai;l xi ;l y i ;R xxi ;R yxi g. Our approach to parameter



8/13

estimation is based on fitting a GMM to the proba-bility distribution of the joint vector zt = [xt, yt]

T forthe source and target cepstra. Covariance matrix R z iand mean vector l z i of class i for joint vectors can bewritten as

R z i ¼

R xxi R xy i

R yxi R

yy i

; l z i ¼

l xi

l y i

. ð9Þ

The expectation-maximization (EM) algorithm(Dempster et al., 1977) is applied here to estimatethe model parameters which guarantees a mono-tonic increase in the likelihood. Starting with aninitial model k, the new model k is estimated bymaximizing the auxiliary function

Qðk; kÞ ¼ XT

t ¼1X I

i¼1

p ðijzt ; kÞ log p ði; zt jkÞ; ð10Þ

where

p ði; zt jkÞ ¼ aiNðzt ; l z i ;

R z

i Þ; ð11Þ

and

p ðijzt ; kÞ ¼ aiNðzt ; l

z i ;R

z i ÞP I

j¼1a jNðzt ; l z j;R

z jÞ

. ð12Þ

On each EM iteration, the reestimation formulasderived for individual parameters of class i are of the form

ai ¼ 1

T

XTt ¼1

p ðijzt ; kÞ; ð13Þ

l z i ¼

PTt ¼1 p ðijzt ; kÞðzt ÞPT

t ¼1 p ðijzt ; kÞ; ð14Þ

R z

i ¼

PTt ¼1 p ðijzt ; kÞðzt l

z i Þðzt l

z i Þ

T

PTt ¼1 p ðijzt ; kÞ

. ð15Þ

The new model k then becomes k for the next itera-tion and the reestimation process is repeated untilthe likelihood reaches a fixed value.

6. Experimental results

Experiments were carried out to investigate thepotential advantages of using the proposed conver-sion algorithms to enhance the hearing-impairedMandarin speech. Our efforts began with the collec-tion of a speech corpus that contained two sets of monosyllabic utterances, one for system learningand one for testing in our voice conversion experi-

ment. The text material consisted of 76 isolated

tonal CV syllables (19 base syllables · 4 tones),formed by pairing the three prominent vowels/a,i,u/ with 11 consonants, the five fricatives andthe six affricates of Mandarin Chinese, but exclud-ing combinations that were phonologically unac-

ceptable. The choice of these two classes wasbased on the research findings showing these conso-nants appeared as the most frequently misarticu-lated sounds made by the hearing-impairedMandarin speakers (Lee, 1999). Speech sampleswere produced by two male adult speakers, one withnormal hearing sensitivity and the other with con-genital severe-to-profound (>70 dB) hearing loss.The speech of the impaired speaker was largely intel-ligible in sentences but often caused misunderstand-ing if produced in syllable forms due to prosodicdeviations and misarticulated initial consonants.

Fig. 3 presents the results of our pitch modifica-tion method for transforming F0 contours. Panels3(a) and 3(c) are the F0 contours for the sourceand the target syllable /ti/ spoken with four differenttones, and panel 3(b) is the converted F0 contourusing VQ and orthogonal polynomial representa-tion. Comparison of F0 variations as a function of time found in panel 3(b) with 3(a) clearly showsthe improvements on tones 2 and 3. Our next exam-ination focused on how the converted F0 contourswere perceived in relation to those of the source.

For easy judgments of the tonal categories, only syl-lables with one consonant class (affricate) were used,with a total of 40 tonal syllables (10 for each tone).Four male and one female adult native speakers of Mandarin Chinese, all with normal hearing status,served as the listeners. Tables 2 and 3 present theconfusion matrices showing the tone recognitionresults for the source and the converted set, respec-tively. The results in each table were based on thelisteners judgments of 400 responses (40 tonal sylla-bles · 5 listeners · 2 sessions). It is clear that theproposed system resulted in more intelligible stimuliwith an average tone recognition score of 86.25%,compared with 69% for the source stimuli. Theresults further showed an improvement of 38%and 28% for syllables with tone 2 and tone 3,respectively.

To establish the statistical significance of theseresults, we calculated the P -value using a Z -test(Johnson and Bhattacharyya, 1996). If we let p1and p2 denote the recognition rates for the sourceand converted set, respectively, our objective wasto test the null hypothesis H 0 : p1P p2. Based on

the statistics in Table 4, the Z -test yielded a small



9/13

P -value (P < 0.0002); therefore, the null hypothesiswas strongly rejected. Further evidence of improve-ment is seen on Fig. 4, which shows our prosodicmodification applied to continuous speech. A four-

syllable utterance, containing tones 4-4-3-3, was

used. According to the tone-sandhi rule, the firsttone 3 should be produced with a tone 2 F0 pattern.The audio presentation, however, showed that thefirst tone 3 was produced more like tone 1 thanthe targeted tone 2. A comparison of the F0contours for the source and the target utterancesshowed that the former exhibited fewer fine fluctua-tion details, even though the variation ranges wereboth within 100 Hz. Further, the first tone 4 wasessentially carrying a tone 1 F0 pattern and the lasttone 3 was produced with the rising part truncated.The improvement due to prosodic modification canbe seen in the following areas. First, the missing fall-ing part in the fist tone 4 and the dipping of the last

tone 3 were fully restored. Second, the rising part of

Fig. 3. F0 contours for syllable /ti/ spoken with four different tones: (a) source speech, (b) converted speech, and (c) target speech.

Table 2Confusion matrix showing tone recognition results for source

syllablesResponse Stimulus

Tone 1 Tone 2 Tone 3 Tone 4

Tone 1 98 32 16 0Tone 2 0 43 35 1Tone 3 2 23 45 9Tone 4 0 2 4 90

Table 3Confusion matrix showing tone recognition results for convertedsyllables

Response StimulusTone 1 Tone 2 Tone 3 Tone 4

Tone 1 97 0 1 0Tone 2 3 81 26 1Tone 3 0 19 73 5Tone 4 0 0 0 94

Table 4Raw data and tone recognition rates derived from Tables 2 and 3

Number of correctidentification

Number of wrongidentification

Recognitionrate

Source stimuli 276 124 p 1 ¼ 276

400

Convertedstimuli

345 55 p 2 ¼ 345

400



10/13

the first tone 3 segment was steeper in slope, makingit more appropriate for the targeted tone 2. To hearaudio examples of the voice conversion system,please visit the web site at http://a61.cm.nctu.edu.tw/demo.

Results of the spectral conversion were analyzedacoustically with software spectrograph to assesshow closely the converted speech resembled the tar-get speech in rendering acoustic cues for phonemeperception. The improvement for the fricatives isshown in three aspects: (1) lengthening of the conso-nant duration, (2) a less abrupt transition, or agradual blending of the acoustic energy, near theconsonant-vowel boundary, and (3) a redistributionof acoustic energy around appropriate frequency re-gions, such as an elevation to 3 kHz for the syllable/shu/ or to 4 kHz for the syllable /shii/. An exampleof such spectral differences for the syllable /shu/ isshown in Fig. 5. Even closer spectrographic matcheswere obtained for the affricates, as shown in Fig. 6using /chii/ as an example. In normal production,affricates are stops followed by fricatives, which are

individually represented on the spectrograph as a

burst with its energy concentrated at higherfrequencies to be blended immediately with thoseof the following fricative. The distorted affricate,however, was translated spectrographically into astop that included a full voicing gap but not muchof frication. Our analysis revealed that the conver-sion filled the gap, softened the burst, removed thelow frequency energy and elevated the fricativeportion to normal frequency ranges. When exam-ined along with audio presentations, this modifica-tion also resulted in a change of the vowel perceptfrom the erroneous, high front but lip-rounding,vowel /yu/ to the correct /i/, even though formantmodification for the vowel was less apparent.

Two listening tests, preference and intelligibility,were conducted to determine whether the abovespectrographic enhancement could also be realizedperceptually. The five listeners for the previous tonerecognition test were used. In the preference test, thelisteners were asked to give their preference judg-ments over pairs of source vs. converted syllables.A two-alternative-forced-choice (2AFC) test para-

digm was used, in which the presentation order of

Fig. 4. F0 contours for a four-syllable phrase /ying-4 yong-4 ruan-3 ti-3/ (meaning application software): (a) source speech, (b) convertedspeech, and (c) target speech.


http://a61.cm.nctu.edu.tw/demohttp://a61.cm.nctu.edu.tw/demohttp://a61.cm.nctu.edu.tw/demohttp://a61.cm.nctu.edu.tw/demo


11/13

the two stimuli was randomized. For convertedstimuli, two sets of converted syllables were used:(1) those with spectral conversion only and (2) thosewith combined spectral and time-scaled conver-sions. The results showed 62% of the 380 responses

(2 stimulus sets · 19 base syllables · 5 listeners · 2sessions) preferred spectrally modified syllables tosource syllables, while 84% preferred those withcombined modifications. To further validate theeffect of the proposed approach, intelligibility

Fig. 5. Spectrograms for syllable /shu/: (a) source speech, (b) converted speech, and (c) target speech.

Fig. 6. Spectrograms for syllable /chii/: (a) source speech, (b) converted speech, and (c) target speech.



12/13

measures were obtained for 19 base syllables beforeand after spectral conversion. The listeners were

instructed to write down their responses using Man-darin phonetic symbols. Fig. 7 shows comparison of the percent correct phoneme recognition scores forthe source and the converted stimuli. Individualphonemes were arranged from left to right intothree groups, fricative, affricate, and vowel. Recog-nition of vowels /a,u/ was near perfect even withoutthe modification. In contrast, recognition for the aff-ricates and the fricatives (with the exception of /h/)was either near or at 0%, a finding consistent withour earlier observation that these two consonantclasses are frequently substituted with stops by thehearing-impaired speakers. The relatively goodrecognition for /h/, even for the source, could be ex-plained by the fact that little oral modification of the glottal air source was required during articula-tion. With the converted stimuli, an improvementwas seen in all three groups. An average increaseof 47.25% was obtained for the fricatives, with /h/counted out. The amount was further increased by20% (=67.17%) for the affricates, with /ji, chi/ show-ing a total correction, making this group thephoneme class that benefited the most from our

application. The vowels, despite their small improve-

ment, were the only group showing a total correctionfor all its members.

7. Conclusions

This study presents a novel means of exploitingspectral and prosodic transformations in enhancingdisordered speech. In spectral conversion, subsylla-ble-based GMMs were applied within the sinusoidalframework to modify the articulation-relatedparameters of speech. In prosodic conversion, wefound the tone structure of F0 contour in Mandarinspeech could be used to advantage in orthogonalpolynomial representation of pitch contours. Theresults also suggest a new approach to time-scalingmodification in which the initial part of a syllableis linearly normalized with a fixed factor, and thena DTW algorithm is used to control the time-vary-ing scaling factor for the final part. Evaluations byobjective tests and listening tests show that the pro-posed techniques can improve the intelligibility andnaturalness of the hearing-impaired Mandarinspeech. Although fairly good performances were re-ported in these experiments, more work is needed tofurther validate the proposed voice conversion

system for a wider range of hearing-impaired speech

Fig. 7. Percent correct phoneme recognition scores for source and converted speech.



13/13

corpora. For example, in extending the current sys-tem to continuous speech, more sophisticated tonemodels may be needed as tone patterns of syllablesin continuous speech are subject to various modifi-cations by sandhi rules.

Acknowledgement

This study was supported by the NationalScience Council, Taiwan, Republic of China, UnderContracts NSC 93-2213-E-009-123 and NSC93-2614-H-134-001-F20.

References

Abe, M., Nakamura, S., Shikano, K., Kuwabara, H., 1988. Voiceconversion through vector quantization. In: Proc. ICASSP88,

pp. 655–658.Bi, N., Qi, Y., 1997. Application of speech conversion to alaryn-geal speech enhancement. IEEE Trans. Speech Audio Process.5, 97–105.

Chang, B.L., 2000. The perceptual analysis of speech intelligibil-ity of students with hearing impairments. Bull. SpecialEducation 18, 53–78.

Chen, S.H., Wang, Y.R., 1990. Vector quantization of pitchinformation in Mandarin speech. IEEE Trans. Communica-tions 38, 1317–1320.

Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihoodfrom incomplete data via the EM algorithm. J. Roy. Statist.Soc. 39, 1–38.

Hochberg, I., Levitt, H., Osberger, M.J., 1983. Speech of The

Hearing Impaired: Research, Training, and Personnel Prep-aration. University Park Press, Maryland.

Johnson, R.A., Bhattacharyya, G.K., 1996. Statistics: Principlesand Methods. John Wiley and Sons, New York.

Kain, A., Macon, M.W., 1998. Spectral voice conversion for text-to-speech synthesis. In: Proc. ICASSP98, pp. 285–288.

Lee, L.S., 1997. Voice dictation of Mandarin Chinese. IEEESignal Process. Mag., 63–101.

Lee, P.C., 1999. A study on acoustic characteristic of Mandarinaffricates of hearing-impaired speech. Bull. Special EduationRehabil. 7, 79–112.

Lin, B.G., Huang, Y.C., 1997. An analysis on the hearingimpaired students Chinese language abilities and its errorpatterns. Bull. Special Educ. 15, 109–129.

Linde, Y., Buzo, A., Gray, R.M., 1980. An algorithm forvector quantizer design. IEEE Trans. Communications 28,84–95.

McAulay, R.J., Quatieri, T.F., 1995. Sinusoidal Coding: Speechcoding and synthesis. Elsevier, Amsterdam.

McGarr, N.S., Harris, K.S., 1983. Articulatory control in deaf speaker. In: Hochberg, I., Levitt, H., Osberger, M.J. (Eds.),Speech of the Hearing Impaired. University Park Press,Baltimore.

Monsen, R., 1978. Toward measuring how well hearing-impaired

children speak. J. Speech Hearing Res. 21, 197–219.Ohde, R.N., Sharf, D.J., 1992. Phonetic Analysis of Normal and

Abnormal Speech. Merrill, New York.Oppenhein, A.V., Schafer, R.W., 1989. Discrete-time Signal

Processing. Prentice Hall, New Jersey.Osberger, M.J., Levitt, H., 1979. The effect of timing errors on

the intelligibility of deaf childrens speech. J. Acoust. Soc.Amer. 66 (5), 1316–1324.

Quatieri, T.F., McAulay, R.J., 1992. Shape invariant time-scaleand pitch modification of speech. IEEE Trans. Signal Process.40 (3), 497–510.

Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recog-nition. Prentice Hall, New Jersey.

Shen, X.S., Lin, M., 1991. A perceptual study of Mandarin tones

2 and 3. Lang. Speech 34 (2), 145–156.Stylianou, Y., Cappe, O., Moulines, E., 1998. Continuous

probabilistic transform for voice conversion. IEEE Trans.Speech Audio Process. 6, 131–142.


Lee 2006 Speech-Communication

Documents