PROGRESS IN SPEECH DIALOG - · PDF fileContinuous Speech Segmentation using HMM”, in 9th International Conference on Pattern Recognition, Roma, ... VVVV are used. The

Patrizia

PROGRESS IN SPEECH DIALOG

Blind Location of Phonetic Boundaries

R. Cmejla and P. Sovka

Department of Circuit Theory, Czech Technical University, Technicka 2, 166 27 Prague 6, Czech Republic

This contribution addresses the location of phonetic boundaries (LPB) for Czech phonetic categories. A novel method based on discriminant function and Bayesian change-point detectors (BCD) is suggested and tested for synthetic and real speech; the consistency and strength of the method was confirmed by experiment. The LPB process for finding significant boundaries consists of four steps: pitch-synchronous segmentation, signal parameterization using Bayesian evidence with polynomial and autoregressive models, discriminant function evaluation and BCD application. The proper boundary location is in average greater than 75% for continuous speech. The error in the time-location of boundaries less than 6 ms can be achieved for affricates/vowels, burst/vowel and silence/most of the phonetic categories.

INTRODUCTION

The detection, estimation and location of speech discontinuities (changepoints) has been intensively studied for several decades. Many methods for speech segmentation based on various characteristics have been developed. The most widely used segmentation principles are the likelihood analysis [1], the hidden Markov models (HMM) [2], the Bayesian approach with a HMM method [3], the combination of the Bayesian approach with rules [4], and discrimination analysis [5]. This contribution deals with the possibility of using the combination of discrimination analysis with Bayesian evidence (BE) [6], and Bayesian changepoint detectors (BCD) [7]. The main motivation for this work was text -to-speech inventory acquisition. This approach requires the training of discriminant functions [9] for chosen speech classes, but it is not as extensive as the model training, which is required if HMM or neural nets are used.

LOCATION OF BOUNDARIES

The LPB process consists of four steps: modified pitch-synchronous segmentation [11], followed by a signal parameterization using BE. Then a suitable discriminant function used for the segment concatenation is applied. Finally two types of BCDs are used to locate the final boundaries.

Signal Parameterization

Signal parameters are estimated by an algorithm of Bayesian model order selection and show us to what degree of accuracy it is possible to describe one pitch

period using polynomial [8] or autoregressive models [6], [7], [8]. The parameter vector v of one pitch period segment is then given by BEs

)]8()1()4()0([ ARARPMPM=v . (1)

Discriminant Function and Concatenation

The discriminant function [9] is determined for each vector v in (1). Two possible classes must be used for segment concatenation. The decision strategy is to associate the current frame with the past frame if there is the same class in both of the neighboring frames. No associate is made if this condition is not met.

Application of Bayesian Detectors

The theory of BCD is given in [6], [7] and its application to speech segmentation can be found in [10]. For this purpose two models of BCDs are used: BSCD - Bayesian step changepoint detector and BLCD - Bayesian linear changepoint detector [8].

The model of segment for the BSCD is composed of two different constants in a noise. Thus the BSCD requires a signal to be modeled by jumps, and thus it is very sensitive to dynamic changes in signal. The instantaneous envelope computed by the Hilbert transform is used as the input. The model of segment for the BLCD is composed of two different linear functions in a noise. The input for BLCD is the cumulative sum of instantaneous frequency of a signal. This type of detector is then sensitive to frequency changes rather than amplitude changes.

RESULTS AND CONCLUSIONS

Results of discriminant analysis for various speech classes are given in Tab. 1. The values were gained from the analysis of 577 segments. The consecutive concatenation algorithm is not able to detect some contexts, e.g. CCCVCC - see the utterance „mzdyvz“ contained at the top in Fig. 1. Therefore the BSCD and BLCD have to be used to find these types of contexts.

Table 1. Recognition score for discrimination function

Category % Phonemes (20 classes) 44 Phonetic categories (8 classes) 73 V/U/S decision (3 classes) 96

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35-1

0

1

Sig

nal

Time

M Z D Y V Z

Time

Fre

quen

cy

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

2000

4000

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.5

1

Time

BS

CD

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.5

1

Time

BLC

D

FIGURE 1. Example of blind boundaries location using BSCD and BLCD detectors.

Table 2. Boundary simulations for testing.

Boundary type BSCD std [ms]

BLCD std [ms]

Vowel – Voiced Fricative 7.9 4.8 Vowel – Voiced Occlussion 7.0 6.2 Vowel – Vowel 7.9 7.4 Semivowel – Vowel 7.0 4.1 Nasal – Vowel 4.8 3.6 Silence – Vowel 6.6 3.4 Burst – Vowel 2.6 2.4 Burst – Semivowel 1.5 2.4 Africate – Vowel 3.8 0.9 Voiceless Fricative – Vowel 2.6 0.8

The final boundaries given by the described algorithm are shown at the bottom in Fig.1. One can see the BLCD gives better results because in the given context there are frequency changes rather than dynamic changes (see spectrogram in Fig.1). The standard deviations (std) in boundaries location for various contexts are given in Tab.2.

ACKNOWLEDGMENTS

This research was supported by the grant GACR 102/96/KO87 Theory and Application of Voice Communication in Czech.

REFERENCES

1. Jeong, Ch.G., Jeong, H., Automatic phone segmentation and labeling of continuous speech, Speech Communication 20, 291-311 (1996).

2. Nakagawa, S., Hashimoto, Y., “A Method for Continuous Speech Segmentation using HMM”, in 9th International Conference on Pattern Recognition, Roma, 1988, pp. 960-962.

3. Nakagawa, S., Hashimoto, Y., “Segmentation of Continuous Speech by HMM and Bayesian Probability”. Transactions of the Institute of Electronics, Information and Communication Engineers D-II, Vol. J72D-II {1}, 1-10 (1989).

4. Meng, H.M., Zue,V.W., “Signal representation comparison for phonetic classification”, in International Conference on Acoustics, Speech and Signal Processing, Vol.1, Toronto, 1991, pp. 285-288.

5. Micallef, P., “Automatic identification of phoneme boundaries using a mixed parameter model”, in.: Eurospeech'97, ESCA, Rhodes, 1997, pp. 485-488.

6. Ruanaidh, J.K.O., Fitzgerald, W. J., Numerical Bayesian Methods Applied to Signal Processing, Springer-Verlag, Berlin Heidelberg New York, 1996.

7. Rayner, P.J.W., Fitzgerald, W.J., “The Bayesian Approach to Signal Modelling and Classification”, in.: The 1 -st European Conference on Signal Analysis and Prediction, ECSAP'97, Prague, 1997, pp. 65-75.

8. Cmejla, R., Sovka, P., “Estimation of Boundaries between Speech Units using Bayesian Changepoint detectors”, in.: Lectures Notes in Computer Science: Text, Speech Dialog 2001, Springer-Verlag, Berlin Heidelberg New York, to appear.

9. Harrington, J., Cassidy, S., Techniques in Speech Acoustics, Cluver Academic Publishers, Dordrecht, Boston, London, 1999.

10. Cmejla, R., Sovka, P., “The use of Bayesian Detector in Signal Processing”, in.: The IASTED Signal and Image Processing'99. IASTED/ACTA Press, Anaheim Calgary Zurich, 1999, pp. 76-80.

11. Prasek, P., “Speech Segmentation and Labelling”, in: POSTER’01, Czech Technical University, Prague, 2001, pp. E24.

Concatenative Synthesis for a Group of Languages

S. Chowdhury, A. K. Datta and B. B. Chaudhuri

Computer Vision and Pattern Recognition Unit,Indian Statistical Institute,

203, B. T. Road,Calcutta 700 035, India.

e-mail: [email protected]

The paper presents a text-to-speech synthesizer capable of producing speech in any language for which IPA symbols areavailable. The system uses concatenation of natural speech units, which are partnemes in the present case. Some of the majorproblems of concatenative approach are addressed with particular reference to Standard Colloquial Bengali. A new approach ofusing random noise to remove the mechanical timbre generally associated with concatinative synthesis is presented. A time-domain method for generating CV, VC, VV transitions from terminal waveforms to match respective steady states is alsodescribed. As it may be difficult to have all possible speech units for a single voice for all IPA symbols, a signal domainapproach for generating missing signal units from close ones using linear approximation of the non-linear dynamics for co-articulatory phenomena is presented to overcome this. A method for controlling intonation and CV transitions using linearapproximation for the non-linearity in the transition to allow control of stress pattern is discussed. In short, signal domainapproach for introducing necessary behavior of suprasegmentals to imitate the natural speech in a dialect is fully discussed.There is no need to fall upon parametric synthesis in any form for installing suprasegmentals.

UNIVERSAL SPEECH SYNTHESIZER(USS)

FIGURE 1. Schematic Diagram of USS

Above figure is for USS which has main two blocks, Aand B. A is the system which is different for differentlanguages and block B is the International PhoneticSynthesizer (IPS). Block A consists of an input device,a text analyzer and an intonational and prosodic rulebases. Text analyzer in block A includes a phoneme

parser, which uses either linguistic rules for phonologyor a phonological dictionary, syllable and word markerand a NLP parser. The part B is the low-levelsynthesizer. Here speech is produced by taking thegrapheme string (in IPA symbol) and information forintonation and prosody as input.

Segment Dictionary

For this, non-sense words of the form CVCVCV andVVVV are used. The smallest speech units arepartneme (i.e. part of a phoneme). The elementarysignal segments are the following:i) all CV transitions start from the VOT upto thebeginning of the steady state of the vowel, where thecoarticulation effect is just stabilized. ii) all VCtransitions, start from the end of steady state of thevowel part upto beginning of the next consonant. iii)all VV transitions, start from the end of the steadystate of the preceding vowel upto the beginning of thatof the following vowel. iv) vowels and nasal murmurs,only a single perceptual-pitch-period [1] from thesteady state of each of them is kept. In this case, eachperiod is assumed to start from the instant precedingthe maximum excitation of vocal tract. v) trills,aspiration, nasal murmurs, fricative and lateral,correspond to the complete consonantal parts of thephonemes of largest possible duration. vi) plosion andaffrication, are taken to start from their release upto

InternationalPhonetic

Synthesizer(IPS)

Output(Speech)

IPA Symbols bus Word number bus Syllable number bus Special emphasis bus

Prosodic and Intonational rules bus

Text AnalyzerPhonological,Prosodic andIntonational

Rules

A

B

Input NLP Unit

PartnemeDictionary

VOT. All the segments are amplitude-normalized byusing different normalizing factors for different vowelsand consonants. They are also normalized for pitch(F0).

F0 Modification

Let, y(n) be the segment whose pitch has to bemodified. First we mark the signal according to thepitch starting from the epoch which is determined asthe minima of the envelope [1]. Let x(n) be each sub-segment and x(n) has N number of sampling pointsand its period is T and let the required period be T1.Now xint(n) be the concatenating signal of x(n) and�x(n), where 0<�<0.25, and we get a signal whoseperiod is 2T. We define a window w(n) in betweenn=1 to n=(NT1)/T on xint(n) as follows:

w(n) = a, [a > 0] for 1 � n � .8NT1/T,

= a� 1

1)1( NTTN

NTnT��

� for .8NT1/T � n � (NT1)/T.

The resultant signal has the required pitch. It alsopreserves the full natural timber of original signal [2].Intensity of the signal may be controlled bymanipulation of the value of ‘a’. The windowing creates prominent striations in the 3-D spectrogram, which produces a perceptiblemechanical horn like sound over and above the normalquality of the voice. This is because such concatinationproduces exactly periodic wave instead of quasi-periodic ones. Normal human voice is not perfectlyperiodic. Two successive pitch cycles do not produceexactly the same pressure waves. The variations arerandom in nature and occur for pitch, amplitude andcomplexity, referred to as jitter, shimmer andcomplexity perturbations respectively. The perceptualmanifestation of these is the quality of sound.Optimum values of these give the produced sound itsnaturalness. An excess perturbation makes the qualityof sound rough or hoarse. Absence of theseperturbations again produces an unnatural horn likesound. Addition of jitter and complexity perturbationalmost removes the defect. A random variation of 2-3% in pitch is introduced for jitter. This essentiallyconsists of adding random integer of proper maximumwith zero mean and required amplitude to be added toT1. Similarly, to introduce complexity perturbation,random numbers with zero mean and proper amplitudeare added to successive values of the samples [3].Finally a smoother algorithm is applied on the signal.Let y(n) [1�n�N-3] be the signal on which smoothingshould be applied. The ith sampling point of the signalis given by,

Y(i) = [y(i)+2y(i+1)+2y(i+2)+y(i+3)]/6.

Transition Generation

For the voiced speech we get some formant structureand that are more or less fixed for each voicedphoneme. The structures depend upon the articulatorsposition during the utterances. When there are twoadjacent phonemes in the utterance, there is acontinuous change in the articulator positions goingfrom the first phoneme to the second. This is revealedby the transition part in the spectrogram. One majorproblem in concatenative synthesis using partneme isthe spectral mismatch between the steady vowel andthe vowel ends to CV or VC transitions. The problemis rectified by generating the transitions from the giventerminal pitch period at both ends. Though thetransitory movement of the spectral structuresparticularly the formants are non-linear, a linearapproximation may the tried. The basic principle issimply to mix the two terminal waveform with suitableweights. Let, Y1(n) and Y2(n) [1� n �N] are the twogiven waveforms, where N is the total number ofsampling points in each of the waveforms. Using thesetwo we have to generate M number of waveforms inbetween. Let, Xi [1� i � M] be the ith waveform inbetween the two. The jth sampling of Xi will be givenby,

MijY

MiMjYjXi *)(1*)()( 21 ��

� , [1 � j � N]

However in preparing the signal dictionary for thesetransitions one need not recreate the whole transition.It would be sufficient to recreate the last two to threepitch periods to obtain a match with the target vowels.Incidentally it also opens up the possibility of creatingsuch transitions when a transition is not available forsome reasons [3]. This assumes a great significance increating the complete signal segment dictionary whereall possible relevant coarticulations for all phonemesin all languages for a single voice are required.

REFERENCES

[1] A K Datta, N R Ganguly, B Mukherjee.“Intonation in segment-concatenated-speech”. Proc.ESCA Workshop on speech synthesis, Sep 1990,France, pp. 153-156.[2] T K Dan, B Mukherjee & A K Datta (1993).“Temporal approach for synthesis of singing (Soprano1).” SMAC 93, pp. 282-287, 1993.[3] S Chowdhury, A K Datta & B B Chaudhuri. “Onthe design of Universal Speech Synthesis in IndianContext”. Proc. IWSMSP 2000, Nov, 2000, India.

Automatic Classification of Communicative Intentions inVoice Man-Machine Interaction

Mario Refice, Michelina Savino

Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Barivia Orabona, 4 – 70125 Bari, ITALY

Speakers’ communicative intentions, like “holding the floor”, in voice interaction can be reliably detected by automatic systemswhen two acoustic cues, namely duration and F0 shape, are taken into account. In this paper, experimental results coming froman automatic classification task performed on spontaneous Map Task dialogues for Bari, Neapolitan and Pisa varieties of Italianare presented and discussed.

INTRODUCTION

Improving the naturalness of existing voice-baseddialogue systems implies the possibility of includingpragmalinguistic knowledge in modelling both human-human and human-machine voice interaction. Oneaspect is represented by the communicativeintentionality conveyed by non-lexical speechphenomena like disfluencies, and possibly also thatconveyed, for instance in terms of discourseorganisation, by distibution and localisation ofprosodic boundaries in spoken discourse. A typicaldisfluency event is represented by filled pauses, whichspeakers use as a planning strategy (i.e. thinking ofwhat they want to say next) and therefore signal to theinterlocutor the intention of “holding the floor” at thatmoment of the interaction. Acoustically, filled pausesare realised by inserting a (normally central) vowel ofvarying length, optionally accompanied by anasalisation. In Italian, where the great majority ofwords end by a vowel, speakers reach the same effectby simply prolonging a word ending vowel. Anothercharacteristic of such Italian “word-final lengthening”filled pauses is that F0 shape is mantained constantlylevel throughout the vowel duration [1]. Backgroundstatistical analysis carried out on spontaneous MapTask based spoken material, relating to the Barivariety of Italian had shown that duration and F0 shapecan be considered as reliable acoustic parameters fordiscriminating among the following 3 classes [1, 2]:

1. stressed vowels in word (but not phrase) finalopen syllable, taken as the “default” category(henceforth DF);

2. stressed vowels in word and phrase final opensyllables (henceforth PF);

3. stressed and unstressed vowels in word final opensyllables, which are characterised by a “word finallengthening” filled pause (henceforth FP).

This paper aims at assessing the reliability of the twomentioned stastistical parameters, by running anautomatic classification task on comparable labelleddata coming from Map Task dialogues in Bari (6speakers), Neaples (8 speakers) and Pisa (8 speakers)varieties of Italian, for a total amount of about 80 min.of speech (which is part of the AVIP corpus). At thisstage we are mainly interested in verifying thereliability of duration and F0 cues for automaticallydetecting “word-final lengthening” filled pauses inItalian spontaneous speech.

CLASSIFICATION

In quantitative terms, the above mentioned 3 classesare statistically defined, with respect to durationparameter, as shown in Table 1, where mean values foreach vowel type (in stressed position), along with therelated Standard Deviation, are reported (data forvowel /u/ in FP was not available since /u/ endingwords are not very common in Italian language).

Table 1. Statistical duration intervals for categories DF, PF and FP (derived from Bari Italian data [1,2]) in msec.Word ending vowel durDF (mean ±±±± SD) durPF (mean ±±±± SD) durFP (mean ±±±± SD)

/a/ 110,1 ± 16,2 177,0 ± 39,2 373,8 ± 105,1/E/ 77,5 ± 11,9 174,0 ± 44,2 369,8 ± 103,1/i/ 101,8 ± 12,2 163,0 ± 26,9 326,9 ± 113,7/O/ 98,1 ± 10,7 170,6 ± 31,4 374,5 ± 92,9/u/ 75,8 ± 15,6 175,0 ± 46,2 --

Vowel duration automatically determines a specificclass, provided the following additional criteria areapplied: a) vowel duration values outside the wholerange are assigned to the closest boundary class; b)vowel duration values not belonging to one of the 3determined intervals are assigned to the closest one.Table 2 shows the percentages of FPs correctlyclassified, using the above mentioned durationcriteria for the 3 Italian varieties under examination.

Table 2. FP cases correctly classified in all Italian databasing on duration parameter

Variety class. as FP class. as PFBari Italian 77% 23%Neapolitan Italian 71% 29%Pisa Italian 76% 24%

It is worth noting that correctness scores are verysimilar among the 3 varieties, suggesting thatduration intervals determined from Bari Italian datacan be considered also as variety-independent. It canbe also noted that percentages of uncorrectness allrefer to erroneous attributions only to PF category(and not to DF one). Since we know from statisticalobservations that F0 shape is another acoustic cue fordiscriminating between FP and PF classes [1], afurther classification task has been performed, thistime taking also F0 parameter into account. In thealgorithm, each time an input vowel is not a FPcandidate (with respect to its duration value), F0stylisation is calculated and if F0 shape is classifiedas “level”, then the pragmalinguistic label “(word-final lengthening) FP” is assigned to the word. Byapplying this rule, percentages of FP classificationcorrectness raise to 100% for all the 3 varieties.Figure 1 shows a PF candidate (according toduration) where F0 stylisation is performed and the“level” shape is attributed: therefore, the final label“FP” is assigned.

FIGURE 1. PF candidate’s F0 shape (vowel /a/)

A similar classification task has been run forassessing the statistical reliability of duration and F0in detecting PF cases, which represent a particular

kind of prosodic boundary characterised by “tonalcrowding”, i.e. when a complex tonal sequence(pitch accent + boundary tone) has to be realised onone syllable (for realisation strategies in Bari Italiansee for example [3]). This task has been carried outfor Bari Italian data only, since for the remainingones prosodically labelled data were not available atthe time of the experiment. Results, based onduration parameter only, show that 67% of cases arecorrectly classified as PFs, 15% of cases are assignedthe DF label, 16% of cases the FP label, and 2% ofcases the label “unclassified” is attached, as they areall cases of /u/ ending words, for which – as shownin Table 1 – only incomplete statistical referenceswere available. When F0 shape is included asadditional parameter, percentage of correctnessimproves up to 83%, since all FP candidates(according to vowel duration only) are correctlyassigned the PF label after F0 stylisation procedure.

CONCLUSIONS

Automatic detection of “word-final lengthening”filled pauses in Bari, Neapolitan and Pisa Italianspontaneous speech can be reliably achieved by atwo step process: 1) using duration parameter forselecting out “default” lexical speech events; 2) usingF0 shape cue for discriminating between FPs and PFprosodic boundaries. This process has proven to beboth speaker- and variety-independent for the threementioned Italian varieties. Things are morecomplicated in the case of PF prosodic boundariesdetection. Preliminary results shows that variabilitytypical of prosodic phenomena, especially when thestressed vowel is also phrase final, is such thatduration alone cannot be assumed as a sufficientlyreliable parameter. By taking into account F0 shapeparameter correctness improves, but it does not reach100%.

REFERENCES1. M. Savino and M. Refice, “Acoustic Cues for

Classifying Communicative Intentions in DialogueSystems”, in Text, Speech and Dialogue (TSD 2000),edited by P. Sojka, I. Kopecek and K. Pala, Berlin:Springer-Verlag, 2000, pp. 421-426.

2. M. Refice and M. Savino, “Identifying CommunicativeFunctions in Dialogue Systems” to appear in SCI2001Proceedings, Orlando (FL) 22-25 July 2001.

3. M. Grice, M. D’Imperio, M. Savino, C. Avesani,“Towards a strategy for ToBI labelling varieties ofItalian” to appear in Prosodic Typology andTranscription: a Unified Approach, edited by Sun-AhJun, Oxford: Oxford University Press.

1150 1200 1250 1300 1350100

120

140

160

180

200

220

240

260

280

300

Hz

m sec.

corrcoeff = 0.85slope = -0.0248

Fuzzy Similarity Measures, Alternative to ImproveDiscriminative Capabilities of HMM Speech Recognizers

I. Gavat, Z. Valsan, B.Sabac, O.Grigore, D. Militaru

Department of Electronics and Telecommunications, Polytechnic University of Bucharest,Splaiul Independentei 113, 77206 Bucharest, Romania

Hidden Markov Models (HMM) based classifiers represent today the most successful technology in speech recognition tasks.With the important advantage of a rich mathematical support, the method has as main drawback the low discriminativecapacity of the models. To improve this performance, we will use in this paper generalized HMM’s based on a fuzzysimilarity measure instead the usual, probabilistic one. On this way, improvements from several percents in recognition ratecomparatively with the classical case can be obtained.

INTRODUCTION

Speech recognition is a research domain with a longhistory, but despite this fact, still open for newinvestigations and answers to the not yet finally solvedquestions. This situation can be explained by thedifficulty of the task of speech recognition, underlyingon the fact that speech is a human product, with a greatdegree of intentionality in content and with a greatvariability in the formal manifestation as acousticsignal, the latest being the basic element for the firstprocessing level in all recognition approaches. To copewith this difficulties of the recognition task, someparadigms are applied for classification, like thestatistical modeling based on HMM’s or theconnectionist paradigm based on neural networks. Thelearning capabilities of statistical and neuronal modelsare very important, leading to the possibility of theclassifier to recognize new, unknown patterns with theexperience obtained by training. Usually, a classifierdecide if a pattern belong or not to a certain class,taking a hard decision.The introduction of fuzzy sets allows the so calledfuzzy decisions, which often are more suitable forrecognition of pattern produced by human beings, forexample speech. Applying fuzzy decisions we haveobtained improved performances in classicalalgorithms (k-NN, ISODATA) and in neuralrecognizers realized with multilayer perceptrons orself-organizing maps [1]. To apply fuzzy concepts toHMM’s was the next natural step to be followed in ourstudies. On other hand, to improve the discriminativecapacities of HMM’s we have with success appliedneurostatistic hybrid structures [2], so that the study ofa fuzzy-statistical hybrid seems very attractive. Such astructure, proposed in [3] for handwritten characters,was applied in a speech recognition task.

FUZZY HMM’s

The additivity hypothesis of the probability measure isnot well-suited for modeling systems that manifest ahigh degree of interdependencies among sources ofinformation. It is the case of speech, for whichsuccessive parameter vectors, which can be goodregarded as interdependent information sources, can bebetter used in recognition processes defining a fuzzymeasure, concept introduced by Sugeno [4]. Based onfuzzy integrals [5], fuzzy measures have as keyproperty the monotonicity with respect to set inclusion,far weaker than the usual additivity property forprobability measures.The generalized model � �� ,, BA� can becharacterized by the same parameters [3] like theclassical, well known model.The major difference in the fuzzy variant, is theinterpretation of the probability densities for theclassical HMM, as fuzzy densities.The succession of parameter vectors, called theobservation sequence, O, produces the state sequenceS of the model, and, visiting for example at themoment t+1 the state qt+1=Sj, the symbol bj isgenerated.The corresponding symbol fuzzy density )( tj Obmeasures the grade of certainty of the statement thatwe observed tO given that we are visiting state .jS

Classification Step

To perform classification tasks, the fuzzy similaritymeasure must be calculated. Based on the fuzzyforward and backward variables, a fuzzy Viterbi

algorithm is proposed in [3] for the case of theChoquet integral with respect to a fuzzy measure andmultiplication as intersection operator.The fuzzy formulation of the forward variable α, bringan important relaxation in the assumption of statisticalindependence.The joint measure � � � �� jty yOO �� ...1� can bewritten as a combination of two measures defined on

tOOO ...,, 21 and on the states respectively, noassumption about the decomposition of this measurebeing necessary, where Y={y1,y2,------yN} representthe states at time t+1. (� is the space of observationvectors).For the standard HMM, the joint measure� �jtt SqOOOP �

�121 ,,..., can be written as the

product � � � �jtt SqPOOOP ��121 ...,,, , so that two

assumptions of statistical independence must be made:the observation at time t+1, 1�tO , is independent ofthe previous observations tOOO ,..., 21 and the statesat time t+1 are independent of the same observations,

tOOO ,..., 21 .This conditions find a poor match in case of speechsignals and therefor we hope in improvements due tothe relaxation permitted by the fuzzy measure.

Training Step

Training of the generalized model can be performedwith the reestimation formulas also done in [3] for theChoquet integral. For each vowel we have trained withthe reestimation formulas the correspondinggeneralized models, GHMMs, with 3-5 states, analogto the classical case.The training set consists in 30 utterances, and the testset in 200 utterances.After the training, we have calculated the fuzzy

measure ��

��

�

OP , with the fuzzy Viterby algorithm

and made the decisions for recognition in the samemanner like for the classical HMM: the correctdecision corresponds to the model for which thecalculated measure has a maximum.

EXPERIMENTAL RESULTS

Starting from the fact that in Romanian language morethan 60% of phonemes are vowels, we have made ourtests on a data base containing the Romanian vowels a,e, i, o, u, ă, â, uttered from eight speakers, five malesand three females, in a great variety of contexts. Eachvowel is available in more than 500 variants. Thevowels are extracted from noisy, telephone speech.

The parameterization is realized with the melcepstralcoefficients and the first and second order differencesof this coefficients deduced from homomorficfiltering.The results obtained in the vowel recognition test aregiven in Table 1, for the case of the classical and of thegeneralized HMM.

Table 1. Error rates (%) for generalized and for classicalHMMs.

Vowel GHMM Classical HMMa 5.1 6.9e 2.4 4.8i 3.8 7.3o 2.5 5.9u 0.7 3.9ă 3.3 7.1â 4.1 6.6

Global 2.9 6.1

A decreasing of nearly 3% is realized in the error rateby adopting the fuzzy measure instead of theprobabilistic one.

CONCLUSIONS

In this paper we have presented a way to improvediscriminative properties of HMM’s, manifestedthrough decreasing with more than 3% of the globalerror rate in a Romanian vowel recognition task.Adopting the fuzzy similarity measure for the case ofgeneralized GHMM’s, the obtained improvements area consequence of avoiding conditional independenceassumption. The results are comparable with theperformances realized with a hybrid neurostatisticalstructure, so that computational complexity will play arole in choosing the solution.

REFERENCES

1. Inge Gavat, O. Grigore, M. Zirra, Oana Cula, "Fuzzy Variants ofHard Classification Rules" in Proc. NAFIPS’97, pp.172-176.

2. Inge Gavat, M. Zirra, Oana Cula, "Hybrid ANN-HMM SpeechRecognition Methods" in Proc. Communications’96, pp 514-519

3. M.Mahomed, P. Gader, "Generalized Hidden Markov Models",IEEE Trans. on Fuzzy Systems, Feb 2000, pp 67-93.

4. Z. Wang, G. Klirr, Fuzzy Measure Theory, New-York, Plenum,1992.

5. M.Grabisch, "Fuzzy Integrals as a Generalized Class of OrderFilters" in Proc. ESSRS'94, pp128-136.

PROGRESS IN SPEECH DIALOG - · PDF fileContinuous Speech Segmentation using HMM”, in 9th International Conference on Pattern Recognition, Roma, ... VVVV are used. The

Documents