Top Banner
Fuzzy Approaches to Speech and Speaker Recognition A thesis submitted for the degree of Doctor of Philosophy of the University of Canberra Dat Tat Tran May 2000
169

Fuzzy Approaches to Speech and Speaker Recognition

Mar 19, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fuzzy Approaches to Speech and Speaker Recognition

Fuzzy Approaches to

Speech and Speaker Recognition

A thesis submitted for the degree

of Doctor of Philosophy of

the University of Canberra

Dat Tat Tran

May 2000

Page 2: Fuzzy Approaches to Speech and Speaker Recognition

Summary of Thesis

Statistical pattern recognition is the most successful approach to automatic speech and

speaker recognition (ASASR). Of all the statistical pattern recognition techniques, the hid-

den Markov model (HMM) is the most important. The Gaussian mixture model (GMM)

and vector quantisation (VQ) are also effective techniques, especially for speaker recognition

and in conjunction with HMMs, for speech recognition.

However, the performance of these techniques degrades rapidly in the context of insuf-

ficient training data and in the presence of noise or distortion. Fuzzy approaches with their

adjustable parameters can reduce such degradation.

Fuzzy set theory is one of the most successful approaches in pattern recognition, where,

based on the idea of a fuzzy membership function, fuzzy C-means (FCM) clustering and

noise clustering (NC) are the most important techniques.

To establish fuzzy approaches to ASASR, the following basic problems are solved. First,

a time-dependent fuzzy membership function is defined for the HMM. Second, a general

distance is proposed to obtain a relationship between modelling and clustering techniques.

Third, fuzzy entropy (FE) clustering is proposed to relate fuzzy models to statistical mod-

els. Finally, fuzzy membership functions are proposed as discriminant functions in decison

making.

The following models are proposed: 1) the FE-HMM, NC-FE-HMM, FE-GMM, NC-FE-

GMM, FE-VQ and NC-FE-VQ in the FE approach, 2) the FCM-HMM, NC-FCM-HMM,

FCM-GMM and NC-FCM-GMM in the FCM approach, and 3) the hard HMM and GMM

as the special models of both FE and FCM approaches. Finally, a fuzzy approach to speaker

verification and a further extension using possibility theory are also proposed.

The evaluation experiments performed on the TI46, ANDOSL and YOHO corpora show

better results for all of the proposed techniques in comparison with the non-fuzzy baseline

techniques.

ii

Page 3: Fuzzy Approaches to Speech and Speaker Recognition

Certificate of Authorship of Thesis

Except as specially indicated in footnotes, quotations and the bibliography, I certify that I

am the sole author of the thesis submitted today entitled—

Fuzzy Approaches to Speech and Speaker Recognition

in terms of the Statement of Requirements for a Thesis issued by the University Higher

Degrees Committee.

Papers containing some of the material of the thesis have been published as Tran [1999,

1998], Tran and Wagner [2000a-h, 1999a-e, 1998], and Tran et al. [2000a,b, 1999a-d, 1998a-

d]. For all of the above joint papers, I certify that the contributions of my co-authors,

Michael Wagner and Tu Van Le, were solely made in their respective roles of primary and

secondary thesis supervisors.

For the joint papers Tran et al. [2000a,b, 1999c, 1998c], my co-author Tuan Pham con-

tributed comments on my literature review and discussions on the theoretical development

while Michael Wagner contributed as a thesis supervisor. For the joint papers Tran et al.

[1999a,b,d], my co-author Tongtao Zheng contributed discussions on the experimental re-

sults while Michael Wagner contributed as a thesis supervisor. For the joint papers Tran et

al. [1998d], my co-author Minh Do contributed part of the programming and some discus-

sions on the theoretical development and the experimental results while Michael Wagner

and Tu Van Le contributed as a thesis supervisor.

Page 4: Fuzzy Approaches to Speech and Speaker Recognition

Acknowledgements

First and foremost, I would like to thank my primary supervisor, Professor Michael Wagner,

for his enormous support and encouragement during my research study at the University

of Canberra. I am also thankful for the advice and guidance he gave me in spite of his

busy schedule, for helping me organise the thesis draft and refine its contents, and for his

patience in answering my inquiries.

I would like to thank my secondary supervisor, Associate Professor Tu Van Le, for his

teaching and support he gave me to be a PhD candidate at the University of Canberra.

I would also like to thank staff members as well as research students at the University

of Canberra, for their support and for maintaining the excellent computing facilities which

were crucial for carrying out my research.

I am grateful for the University of Canberra Research Scholarship, which enabled me

to undertake this research in the period February 1997 - February 2000. I would also like

to thank the School of Computing and Division of Management and Technology which

provided funding for attending several conferences.

I would like to express my gratitude to all my lecturers and colleagues at the Depart-

ment of Theoretical Physics, Faculty of Physics, and Faculty of Mathematics, University

of Ho Chi Minh City, Viet Nam.

Many special thanks to my family members. I am indebted to my parents for the

sacrifices they have made for me. I wish to thank my brothers-in-law and sisters-in-law as

well as my wife, Phuong Dao, my son, Nguyen Tran, and my daughter, Thao Tran, for

they have given me much support throughout the years of my thesis research.

Finally, this work is to the memory of my previous supervisor, Professor Phi Van Duong,

a scientist of the Abdus Salam International Centre for Theoretical Physics (ICTP), Tri-

este, Italy. A special thanks for his teaching, love, advice, guidance, support and encour-

agement throughout 12 years at the University of Ho Chi Minh City, Viet Nam.

iv

Page 5: Fuzzy Approaches to Speech and Speaker Recognition

Contents

Summary of Thesis ii

Acknowledgements iv

List of Abbreviations xv

1 Introduction 1

1.1 Current Approaches to Speech and Speaker Recognition . . . . . . . . 1

1.1.1 Statistical Pattern Recognition Approach . . . . . . . . . . . . 2

1.1.2 Modelling Techniques: HMM, GMM and VQ . . . . . . . . . . 2

1.2 Fuzzy Set Theory-Based Approach . . . . . . . . . . . . . . . . . . . 3

1.2.1 The Membership Function . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Clustering Techniques: FCM and NC . . . . . . . . . . . . . . 4

1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Fuzzy Entropy Models . . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Fuzzy C-Means Models . . . . . . . . . . . . . . . . . . . . . . 8

1.4.3 Hard Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.4 A Fuzzy Approach to Speaker Verification . . . . . . . . . . . 8

1.4.5 Evaluation Experiments and Results . . . . . . . . . . . . . . 9

1.5 Extensions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Literature Review 10

2.1 Speech Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Speech Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . 11

v

Page 6: Fuzzy Approaches to Speech and Speaker Recognition

CONTENTS vi

2.1.2 Speech Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Speech and Speaker Recognition . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Statistical Modelling Techniques . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Maximum A Posteriori Rule . . . . . . . . . . . . . . . . . . . 20

2.3.2 Distribution Estimation Problem . . . . . . . . . . . . . . . . 21

2.3.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 22

2.3.4 Hidden Markov Modelling . . . . . . . . . . . . . . . . . . . . 24

Parameters and Types of HMMs . . . . . . . . . . . . . . . . 24

Three Basic Problems for HMMs . . . . . . . . . . . . . . . . 26

2.3.5 Gaussian Mixture Modelling . . . . . . . . . . . . . . . . . . . 29

2.3.6 Vector Quantisation Modelling . . . . . . . . . . . . . . . . . . 31

2.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4 Fuzzy Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.1 Fuzzy Sets and the Membership Function . . . . . . . . . . . . 34

2.4.2 Maximum Membership Rule . . . . . . . . . . . . . . . . . . . 35

2.4.3 Membership Estimation Problem . . . . . . . . . . . . . . . . 35

2.4.4 Pattern Recognition and Cluster Analysis . . . . . . . . . . . 35

2.4.5 Hard C-Means Clustering . . . . . . . . . . . . . . . . . . . . 36

2.4.6 Fuzzy C-Means Clustering . . . . . . . . . . . . . . . . . . . . 37

Fuzzy C-Means Algorithm . . . . . . . . . . . . . . . . . . . . 38

Gustafson-Kessel Algorithm . . . . . . . . . . . . . . . . . . . 38

Gath-Geva Algorithm . . . . . . . . . . . . . . . . . . . . . . 39

2.4.7 Noise Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.5 Fuzzy Approaches in the Literature . . . . . . . . . . . . . . . . . . . 42

2.5.1 Maximum Membership Rule-Based Approach . . . . . . . . . 42

2.5.2 FCM-Based Approach . . . . . . . . . . . . . . . . . . . . . . 43

Page 7: Fuzzy Approaches to Speech and Speaker Recognition

CONTENTS vii

3 Fuzzy Entropy Models 46

3.1 Fuzzy Entropy Clustering . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Modelling and Clustering Problems . . . . . . . . . . . . . . . . . . . 50

3.3 Maximum Fuzzy Likelihood Estimation . . . . . . . . . . . . . . . . . 51

3.4 Fuzzy Entropy Hidden Markov Models . . . . . . . . . . . . . . . . . 52

3.4.1 Fuzzy Membership Functions . . . . . . . . . . . . . . . . . . 52

3.4.2 Fuzzy Entropy Discrete HMM . . . . . . . . . . . . . . . . . . 54

3.4.3 Fuzzy Entropy Continuous HMM . . . . . . . . . . . . . . . . 57

3.4.4 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 58

3.5 Fuzzy Entropy Gaussian Mixture Models . . . . . . . . . . . . . . . . 59

3.5.1 Fuzzy Entropy GMM . . . . . . . . . . . . . . . . . . . . . . . 59

3.5.2 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 60

3.6 Fuzzy Entropy Vector Quantisation . . . . . . . . . . . . . . . . . . . 61

3.7 A Comparison Between Conventional and Fuzzy Entropy Models . . . 61

3.8 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 64

4 Fuzzy C-Means Models 66

4.1 Minimum Fuzzy Squared-Error Estimation . . . . . . . . . . . . . . . 66

4.2 Fuzzy C-Means Hidden Markov Models . . . . . . . . . . . . . . . . . 67

4.2.1 FCM Discrete HMM . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.2 FCM Continuous HMM . . . . . . . . . . . . . . . . . . . . . 68

4.2.3 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 69

4.3 Fuzzy C-Means Gaussian Mixture Models . . . . . . . . . . . . . . . 70

4.3.1 Fuzzy C-Means GMM . . . . . . . . . . . . . . . . . . . . . . 70

4.3.2 Noise Clustering Approach . . . . . . . . . . . . . . . . . . . . 70

4.4 Fuzzy C-Means Vector Quantisation . . . . . . . . . . . . . . . . . . . 71

4.5 Comparison Between FCM and FE Models . . . . . . . . . . . . . . . 71

4.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 75

5 Hard Models 76

5.1 From Fuzzy To Hard Models . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 Hard Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.1 Hard Discrete HMM . . . . . . . . . . . . . . . . . . . . . . . 80

Page 8: Fuzzy Approaches to Speech and Speaker Recognition

CONTENTS viii

5.2.2 Hard Continuous HMM . . . . . . . . . . . . . . . . . . . . . 80

5.3 Hard Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . 81

5.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 83

6 A Fuzzy Approach to Speaker Verification 85

6.1 A Speaker Verification System . . . . . . . . . . . . . . . . . . . . . . 86

6.2 Current Normalisation Methods . . . . . . . . . . . . . . . . . . . . . 87

6.3 Proposed Normalisation Methods . . . . . . . . . . . . . . . . . . . . 89

6.4 The Likelihood Transformation . . . . . . . . . . . . . . . . . . . . . 92

6.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 96

7 Evaluation Experiments and Results 97

7.1 Database Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1.1 The TI46 database . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1.2 The ANDOSL Database . . . . . . . . . . . . . . . . . . . . . 98

7.1.3 The YOHO Database . . . . . . . . . . . . . . . . . . . . . . . 99

7.2 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.3 Algorithmic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.3.1 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.3.2 Constraints on parameters during training . . . . . . . . . . . 100

7.4 Isolated Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . 102

7.4.1 E set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.4.2 10-Digit&10-Command Set Results . . . . . . . . . . . . . . . 105

7.4.3 46-Word Set Results . . . . . . . . . . . . . . . . . . . . . . . 106

7.5 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.5.1 TI46 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.5.2 ANDOSL Results . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.5.3 YOHO Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.6 Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.6.1 TI46 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.6.2 ANDOSL Results . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.6.3 YOHO Results . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.7 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 115

Page 9: Fuzzy Approaches to Speech and Speaker Recognition

CONTENTS ix

8 Extensions of the Thesis 123

8.1 Possibility Theory-Based Approach . . . . . . . . . . . . . . . . . . . 123

8.1.1 Possibility Theory . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.1.2 Possibility Distributions . . . . . . . . . . . . . . . . . . . . . 125

8.1.3 Maximum Possibility Rule . . . . . . . . . . . . . . . . . . . . 125

8.2 Possibilistic C-Means Approach . . . . . . . . . . . . . . . . . . . . . 126

8.2.1 Possibilistic C-Means Clustering . . . . . . . . . . . . . . . . . 126

8.2.2 PCM Approach to FE-HMMs . . . . . . . . . . . . . . . . . . 127

8.2.3 PCM Approach to FCM-HMMs . . . . . . . . . . . . . . . . . 128

8.2.4 PCM Approach to FE-GMMs . . . . . . . . . . . . . . . . . . 129

8.2.5 PCM Approach to FCM-GMMs . . . . . . . . . . . . . . . . . 130

8.2.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . 130

9 Conclusions and Future Research 133

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

9.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . 135

Bibliography 136

A List of Publications 152

Page 10: Fuzzy Approaches to Speech and Speaker Recognition

List of Figures

2.1 The speech signal of the utterance “one” (a) in the long period of time

from t = 0.3 sec to t = 0.6 sec and (b) in the short period of time from

t = 0.4 sec to t = 0.42 sec. . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Block diagram of LPC front-end processor for speech and speaker

recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 An N -state left-to-right HMM with Δi = 1 . . . . . . . . . . . . . . . 28

2.4 Relationships between HMM, GMM, and VQ techniques . . . . . . . 33

2.5 A statistical classifier for isolated word recognition and speaker iden-

tification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6 Clustering techniques and their extended versions . . . . . . . . . . . 41

3.1 Generating 3 clusters with different values of n: hard clustering as

n → 0, clusters increase their overlap with increasing n > 0, and are

identical to a single cluster as n → ∞ . . . . . . . . . . . . . . . . . . 49

3.2 States at each time t = 1, . . . , T are regarded as time-dependent fuzzy

sets. There are N ×T fuzzy states connected by arrows into NT fuzzy

state sequences in the fuzzy HMM. . . . . . . . . . . . . . . . . . . . 53

3.3 The observation sequence O belongs to fuzzy state sequences being in

fuzzy state i at time t and fuzzy state j at time t + 1. . . . . . . . . . 54

3.4 The observation sequence X belongs to fuzzy state j and fuzzy mixture

k at time t in the fuzzy continuous HMM. . . . . . . . . . . . . . . . 55

3.5 From fuzzy entropy models to conventional models . . . . . . . . . . 62

3.6 The membership function uit with different values of the degree of fuzzy

entropy n versus the distance dit between vector xt and cluster i . . . 64

3.7 Fuzzy entropy models for speech and speaker recognition . . . . . . . 65

x

Page 11: Fuzzy Approaches to Speech and Speaker Recognition

LIST OF FIGURES xi

4.1 The relationship between FE model groups versus the degree of fuzzy en-

tropy n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 The relationship between FCM model groups versus the degree of fuzziness m 72

4.3 The FCM membership function uit with different values of the degree

of fuzziness m versus the distance dit between vector xt and cluster i 73

4.4 Curves representing the functions used in the FE and FCM member-

ships, where x = d2it, m = 2 and n = 1 . . . . . . . . . . . . . . . . . 74

4.5 Fuzzy C-means models for speech and speaker recognition . . . . . . 75

5.1 From hard VQ to fuzzy VQ: an additional fuzzy entropy term for fuzzy

entropy VQ, and a weighting exponent m > 1 on each uit for fuzzy C-

means VQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 From fuzzy VQ to (hard) VQ: n → 0 for FE-VQ or m → 1 for FCM-VQ

or using the minimum distance rule to compute uit directly. . . . . . . 77

5.3 Mutual relations between fuzzy and hard models . . . . . . . . . . . . 78

5.4 Possible state sequences in a 3-state Bakis HMM and a 3-state fuzzy

Bakis HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.5 A possible single state sequence in a 3-state hard HMM . . . . . . . . 79

5.6 A mixture of three Gaussian distributions in the GMM or the fuzzy

GMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.7 A set of three non overlapping Gaussian distributions in the hard GMM. 82

5.8 Relationships between hard models . . . . . . . . . . . . . . . . . . . 84

6.1 A Typical Speaker Verification System . . . . . . . . . . . . . . . . . 86

6.2 The transformation T where T (P )/P increases and T (P ) is non posi-

tive for 0 ≤ P ≤ 1: values of 4 ratios at A, B, C, and D are moved to

those at A’, B’, C’, and D’ . . . . . . . . . . . . . . . . . . . . . . . . 93

6.3 Histograms of speaker f7 in the TI46 using 16-mixture GMMs. The

EER is 6.67% for Fig. 6.3a and is 5.90% for Fig. 6.3b. . . . . . . . . 94

7.1 Isolated word recognition error (%) versus the number of state N for

the digit-set vocabulary, using left-to-right DHMMs, codebook size K

= 16, TI46 database . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Page 12: Fuzzy Approaches to Speech and Speaker Recognition

LIST OF FIGURES xii

7.2 Speaker Identification Error (%) versus the degree of fuzziness m using

FCM-VQ speaker models, codebook size K = 16, TI46 corpus . . . . 102

7.3 Isolated word recognition error (%) versus the degree of fuzzy entropy n

for the E-set vocabulary, using 6-state left-to-right FE-DHMMs, code-

book size of 16, TI46 corpus . . . . . . . . . . . . . . . . . . . . . . . 103

7.4 Speaker identification error rate (%) versus the number of mixtures for

16 speakers, using conventional GMMs, FCM-GMMs and NC-FCM-

GMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.5 Speaker identification error rate (%) versus the codebook size for 16

speakers, using VQ, FE-VQ and NC-FE-VQ codebooks . . . . . . . . 110

7.6 EERs (%) for GMM-based speaker verification performed on 16 speak-

ers, using GMMs, FCM-GMMs and NC-FCM-GMMs. . . . . . . . . . 112

7.7 EERs (%) for VQ-based speaker verification performed on 16 speakers,

using VQ, FE-VQ and NC-FE-VQ codebooks . . . . . . . . . . . . . 113

8.1 PCM Clustering in Clustering Techniques . . . . . . . . . . . . . . . 131

8.2 PCM Approach to FE models for speech and speaker recognition . . 131

8.3 PCM Approach to FCM models for speech and speaker recognition . 132

Page 13: Fuzzy Approaches to Speech and Speaker Recognition

List of Tables

3.1 An example of memberships for the GMM and the FE-GMM . . . . . 64

6.1 The likelihood values for 4 input utterances X1−X4 against the claimed

speaker λ0 and 3 impostors λ1−λ3, where Xc1, Xc

2 are from the claimed

speaker and X i3, X i

4 are from impostors . . . . . . . . . . . . . . . . . 90

6.2 Scores of 4 utterances using L3(X) and L8(X) . . . . . . . . . . . . . 91

6.3 Scores of 4 utterances using L3nc(X) and L8nc(X) . . . . . . . . . . . 92

7.1 Isolated word recognition error rates (%) for the E set . . . . . . . . . 104

7.2 Speaker-dependent recognition error rates (%) for the E set . . . . . . 106

7.3 Isolated word recognition error rates (%) for the 10-digit set . . . . . 107

7.4 Isolated word recognition error rates (%) for the 10-command set . . 107

7.5 Isolated word recognition error rates (%) for the 46-word set . . . . . 108

7.6 Speaker-dependent recognition error rates (%) for the 46-word set . . 117

7.7 Speaker identification error rates (%) for the ANDOSL corpus using

conventional GMMs, FE-GMMs and FCM-GMMs. . . . . . . . . . . 118

7.8 Speaker identification error rates (%) for the YOHO corpus using con-

ventional GMMs, hard GMMs and VQ codebooks. . . . . . . . . . . . 119

7.9 EER Results (%) for the ANDOSL corpus using GMMs with different

background speaker sets. Rows in bold are the current normalisation

methods , others are the proposed methods. The index “nc” denotes

noise clustering-based methods. . . . . . . . . . . . . . . . . . . . . . 120

7.10 Equal Error Rate (EER) Results (%) for the YOHO corpus. Rows in

bold are the current normalisation methods , others are the proposed

methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

xiii

Page 14: Fuzzy Approaches to Speech and Speaker Recognition

LIST OF TABLES xiv

7.11 Comparisons of EER Results (%) for the YOHO corpus using GMMs,

hard GMMs and VQ codebooks. Rows in bold are the current normal-

isation methods, others are the proposed methods. . . . . . . . . . . . 122

Page 15: Fuzzy Approaches to Speech and Speaker Recognition

List of Abbreviations

ANN artificial neural network

CHMM continuous hidden Markov model

DTW dynamic time warping

DHMM discrete hidden Markov model

EM expectation maximisation

FE fuzzy entropy

FCM fuzzy C -means

GMM Gaussian mixture model

HMM hidden Markov model

LPC linear predictive coding

MAP maximum a posteriori

ML maximum likelihood

MMI maximum mutual information

NC noise clustering

pdf probability density function

PCM possibilistic C -means

VQ vector quantisation

xv

Page 16: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 1

Introduction

Research in automatic speech and speaker recognition by machine has been conducted

for more than four decades. Speech recognition is the process of automatically recog-

nising the linguistic content in a spoken utterance. Speaker recognition can be clas-

sified into two specific tasks: identification and verification. Speaker identification

is the process of determining who is speaking based on information obtained from

the speaker’s speech. Speaker verification is the process of accepting or rejecting the

identity claim of a speaker.

1.1 Current Approaches to Speech and Speaker

Recognition

Three current approaches to speech and speaker recognition by machine are the

acoustic-phonetic approach, the pattern-recognition approach and the artifical in-

telligence approach. The acoustic-phonetic approach is based on the theory of acous-

tic phonetics which postulates that there exist finite, distinctive phonetic units in

spoken language and that the phonetic units are broadly characterised by sets of

properties that are manifest in the speech signal or its spectrum over time. The

acoustic-phonetic approach is usually based on segmentation of the speech signal and

subsequent feature extraction. The main problem with the acoustic-phonetic ap-

proach is the variability of the acoustic properties of a phoneme depending on many

factors including acoustic context, speaker gender, age, emotional state, etc. The

1

Page 17: Fuzzy Approaches to Speech and Speaker Recognition

1.1 Current Approaches to Speech and Speaker Recognition 2

pattern-recognition approach generally uses the speech patterns directly, i.e. without

explicit feature determination and segmentation. This method has two steps: training

of speech patterns, and recognition of patterns via pattern comparison. Finally, the

artifical intelligence approach is a hybrid of the acoustic-phonetic approach and the

pattern-recognition approach in that it exploits ideas and concepts of both methods,

especially the use of an expert system for segmentation and labelling, and the use of

neural networks for learning the relationships between phonetic events and all known

inputs [Rabiner and Juang 1993].

1.1.1 Statistical Pattern Recognition Approach

In general, the pattern-recognition approach is the method of choice for speech and

speaker recognition because of its simplicity of use, proven high performance, and

relative robustness and invariance to different speech vocabularies, users, algorithms

and decision rules. The most successful approach to speech and speaker recognition

is to treat the speech signal as a stochastic pattern and to use a statistical pattern

recognition technique. The statistical formulation has its root in the classical Bayes

decision theory, which links a recognition task to the distribution estimation problem.

In order to be practically implementable, the distributions are usually parameterised

and thus the distribution estimation problem becomes a parameter estimation prob-

lem, where a reestimation algorithm using a set of parameter estimation equations

is established to find the right parametric form of the distributions. The unknown

parameters defining the distribution have to be estimated from the training data. To

obtain reliable parameter estimates, the training set needs to be of sufficient size in

relation to the number of parameters. When the amount of the training data is not

sufficient, the quality of the distribution parameter estimates cannot be guaranteed.

In other words, the minimum Bayes risk generally remains an unachievable lower

bound [Huang et al. 1996].

1.1.2 Modelling Techniques: HMM, GMM and VQ

In the statistical pattern recognition approach, hidden Markov modelling of the speech

signal is the most important technique. It has been used extensively to model funda-

Page 18: Fuzzy Approaches to Speech and Speaker Recognition

1.2 Fuzzy Set Theory-Based Approach 3

mental speech units in speech recognition because the hidden Markov model (HMM)

can adequately characterise both the temporally and spectrally varying nature of

the speech signal [Rabiner and Juang 1993]. In speech recognition, the left-to-right

HMM with only self and forward transitions is the simplest word and subword model.

In text-dependent speaker recognition, training comprises spectral and temporal rep-

resentations of specific utterances and testing must use the same utterances, the

ergodic HMM with all possible transitions can be used for broad phonetic categori-

sation [Matsui and Furui 1992]. In text-independent speaker recognition, training

comprises a good representation of all the speaker’s speech sounds and testing can be

any utterance, there are no constraints on the training and test text and the temporal

information has not been shown to be useful [Reynolds 1992]. The Gaussian mixture

model (GMM) is used in this case. The GMM uses a mixture of Gaussian densities

to model the distribution of a speaker-specific feature vectors. When little data is

available, the vector quantisation (VQ) technique is also effective in characterising

speaker-specific features. The VQ model is a codebook and is generated by clustering

the training feature vectors of each speaker.

1.2 Fuzzy Set Theory-Based Approach

An alternative successful approach in pattern recognition is the fuzzy set theory-

based approach. Fuzzy set theory was introduced by Zadeh [1965] to represent and

manipulate data and information that possess nonstatistical uncertainty. Fuzzy set

theory is a generalisation of conventional set theory that was introduced as a new

way of representing the vagueness or imprecision that is ever present in our daily

experience as well as in natural language [Bezdek 1993].

1.2.1 The Membership Function

The membership function is the basic idea in fuzzy set theory. The membership of a

point in a fuzzy set represents the degree to which the point belongs to this fuzzy set.

The first fuzzy approach to speech and speaker recognition was the use of the member-

ships as discriminant functions in decision making [Pal and Majumder 1977]. In fuzzy

clustering, the memberships are not known in advance and have to be estimated from

Page 19: Fuzzy Approaches to Speech and Speaker Recognition

1.3 Problem Statement 4

a training set of observations with known class labels. The membership estimation

procedure in fuzzy pattern recognition is called the abstraction [Bellman et al. 1966].

1.2.2 Clustering Techniques: FCM and NC

The most successful technique in fuzzy cluster analysis is fuzzy C-means (FCM) clus-

tering, it is widely used in both theory and practical applications of fuzzy clustering

techniques to unsupervised classification [Zadeh 1977]. FCM clustering [Dunn 1974,

Bezdek 1973] is an extension of hard C-means (HCM), also known as K-means clus-

tering [Duda and Hart 1973]. A general estimation procedure for the FCM technique

has been established and its convergence has been shown [Bezdek and Pal 1992].

However, the FCM technique has a problem of sensitivity to outliers. The sum

of the memberships of a feature vector across classes is always equal to one both for

clean data and for noisy data. It would be more reasonable that, if the feature vector

comes from noisy data or outliers, the memberships should be as small as possible

for all classes and the sum should be smaller than one. This property is important

since all parameter estimates are computed based on these memberships. An idea of

a noise cluster has been proposed [Dave 1991] in the noise clustering (NC) technique

to deal with noisy data or outliers.

All the above-mentioned subjects are reviewed respect to speech and speaker

recognition, statistical modelling and clustering in Chapter 2.

1.3 Problem Statement

In general, the most successful approach in speech and speaker recognition is the

statistical pattern recognition approach where the HMM is the most important tech-

nique and with the GMM and VQ also used in speaker recognition. However, the

performance of these techniques degrades rapidly in the context of insufficient train-

ing data and in the presence of noise or distortion. Fuzzy approaches with their

adjustable parameters can hopefully reduce such degradation. In pattern recogni-

tion, the fuzzy set theory-based approach is one of the most succesful approaches and

FCM clustering is the most important technique. Therefore, to obtain fuzzy pattern

recognition approaches to statistical methods in speech and speaker recognition, we

Page 20: Fuzzy Approaches to Speech and Speaker Recognition

1.3 Problem Statement 5

need to solve the following basic problems. In the training phase, these are: 1) how

to determine the fuzzy membership functions for the statistical models and 2) how

to estimate these fuzzy membership functions from a training set of observations?

In the recognition phase, the basic problem is: 3) how to use the fuzzy membership

functions as discriminant functions for recognition?

For the first problem, we begin with the HMM technique in the statistical pattern

recognition approach and through considering the HMM, we show how the concept of

the fuzzy membership function can be used. In hidden Markov modelling, the under-

lying assumption is that the observation sequence obtained from speech processing can

be characterised as generated by a stochastic process. Each observation is regarded

as the output of another stochastic process—the hidden Markov process—which is

governed by the output probability distribution. A first-order hidden Markov process

consists of a finite state sequence where the initial state is governed by an initial

state distribution and state transitions which occur at discrete time t are governed

by transition probabilities which only depend on the previous state. The observation

sequence is regarded as being produced by different state sequences with correspond-

ing probabilities. This situation can be more effectively represented by using fuzzy

set theory, where states at each time t are regarded as time-dependent fuzzy sets and

called fuzzy states. The time-dependent fuzzy membership ust=i(ot) can be defined

as the degree of belonging of the observation ot to the fuzzy state st = i at time t.

However, the observations are always considered in the sequence O and related to

the state sequence S. Fuzzy state sequences are thus also defined as a sequence of

fuzzy states in time and the fuzzy membership function is defined for the observation

sequence O in fuzzy state sequence S, based on the fuzzy membership function of

the observation ot in the fuzzy state st [Tran and Wagner 1999a]. For example, to

compute the state transition matrix A, we consider fuzzy states at time t and time

t+1 included in corresponding fuzzy state sequences and define the fuzzy membership

function ust=i st+1=j(O). This membership denotes the degree of belonging of the ob-

servation sequence O to fuzzy state sequences being in fuzzy state st = i at time t and

fuzzy state st+1 = j at time t+1. In this approach, probability and fuzziness are com-

plementary rather than competitive. For the HMM, probability deals with stochastic

processes for the observation sequence and state sequences whereas fuzziness deals

Page 21: Fuzzy Approaches to Speech and Speaker Recognition

1.3 Problem Statement 6

with the relationship between these sequences [Tran and Wagner 1999c].

For the second problem, estimating the fuzzy membership functions is based

on the selection of an optimisation criterion. The minimum squared-error crite-

rion used in the FCM clustering and the NC techniques is very effective in clus-

ter analysis [Bezdek and Pal 1992, Dave 1990], thus it can be applied to statisti-

cal modelling techniques. However, there are two sub-problems to be solved before

applying it. The first question is, how to obtain a relationship between the clus-

tering and modelling techniques, since the goal of clustering techniques is to find

optimal partitions of data [Dave and Krishnapuram 1997] whereas the goal of sta-

tistical modelling techniques is to find the right parametric form of the distributions

[Juang et al. 1996]. For this sub-problem, a general distance for clustering techniques

is proposed [Tran and Wagner 1999b]. The distance is defined as a decreasing func-

tion of the component probability density, and hence grouping similar feature vectors

into a cluster becomes classification of these vectors into a component distribution.

Clusters are now represented by component distribution functions and hence charac-

teristics of a cluster are not only its shape and location, but also the data density

in the cluster and possibly the temporal structure of data if the Markov process is

applied [Tran and Wagner 2000a]. Finding good partitions of data in clustering tech-

niques thus leads to finding the right parametric form of distributions in the modelling

techniques. The second sub-problem is the relationship between the fuzzy and con-

ventional models. Fuzzy models using FCM clustering can reduce to hard models

using HCM clustering if the degree of fuzziness m > 1 tends to 1 [Tran et al. 2000a].

However, the conventional HMM is not a hard model since there is more than one

possible state at each time t, therefore the relationship between the fuzzy HMM and

the HMM has not been established as well since the hard HMM has not yet been de-

fined. For solving this problem, we propose an alternative clustering technique called

fuzzy entropy (FE) clustering [Tran and Wagner 2000f] and apply this technique to

the HMM to obtain the fuzzy entropy HMM (FE-HMM) [Tran and Wagner 2000b].

The degree of fuzzy entropy n > 0 is introduced in the FE-HMM. As n tends to 1,

FE-HMMs reduce to conventional HMMs. We also propose the hard HMM where

only the best state sequence is employed by using a binary (zero-one) membership

function [Tran et al. 2000a]. A hard GMM is also proposed, which employs only

Page 22: Fuzzy Approaches to Speech and Speaker Recognition

1.4 Contributions of This Thesis 7

the most likely Gaussian distribution among the mixture of Gaussians to represent a

feature vector [Tran et al. 2000b].

For the third problem, it can be seen that the role of the fuzzy membership

function in fuzzy set theory and of the a posteriori probability in the Bayes deci-

sion theory are quite similar. Therefore the currently used maximum a posteriori

(MAP) decision rule can be generalised to the maximum fuzzy membership decision

rule [Tran et al. 1998b, Tran et al. 1998d]. Depending on which fuzzy technique is

applied, we can find a suitable form for the fuzzy membership function. For example,

in speaker verification, the fuzzy membership of an input utterance in the claimed

speaker’s fuzzy set of utterances is used as a similarity score to compare with a given

threshold in order to accept or reject this speaker. The fuzzy membership function

is determined as the ratio of functions of the claimed speaker’s and impostors’ like-

lihood functions [Tran and Wagner 2000c, Tran and Wagner 2000d].

1.4 Contributions of This Thesis

Based on solving the above-mentioned problems, fuzzy approaches to speech and

speaker recognition are proposed and evaluated in this thesis as follows.

1.4.1 Fuzzy Entropy Models

This fuzzy approach is presented in Chapter 3. FE models are based on a basic

algorithm termed FE clustering. The goal of this approach is not only to propose

a new fuzzy approach but also to show that statistical models, such as the HMM

and the GMM in the maximum likelihood scheme, can be viewed as fuzzy mod-

els, where probabilities of unobservable data, given observable data, are used as

fuzzy membership functions. Fuzzy entropy clustering, the maximum fuzzy like-

lihood criterion, the fuzzy EM algorithm, the fuzzy membership function as well

as FE-HMMs, FE-GMMs and FE-VQ and their NC versions are proposed in this

chapter [Tran and Wagner 2000a, Tran and Wagner 2000b, Tran and Wagner 2000f,

Tran and Wagner 2000g, Tran 1999].

The adjustment of the degree of fuzzy entropy n in FE models is an advantage.

When conventional models do not work well because of the insufficient training data

Page 23: Fuzzy Approaches to Speech and Speaker Recognition

1.4 Contributions of This Thesis 8

problem or the complexity of the speech data, such as the nine English E-set letters,

a suitable value of n can be found to obtain better models.

1.4.2 Fuzzy C-Means Models

Chapter 4 presents this fuzzy approach. It is based on FCM clustering in fuzzy

pattern recognition. FCM models are estimated by the minimum fuzzy squared-error

criterion used in FCM clustering. The fuzzy EM algorithm is reformulated for this

criterion. FCM-HMMs, FCM-GMMs and FCM-VQ are respectively presented as well

as their NC versions. A discussion on the role of fuzzy memberships of FCM models

and a comparison between FCM models and FE models are also presented. Similarly

to FE models, FCM models also have an adjustable parameter called the degree of

fuzziness m > 1. Better models can be obtained in the case of the insufficient training

data problem and the complexity of speech data problem using a suitable value of m

[Tran and Wagner 1999a]-[Tran and Wagner 1999e].

1.4.3 Hard Models

As the degrees of fuzzy entropy and fuzziness tend to their minimum values, both

fuzzy entropy and fuzzy C-means models tend to the same limit which is the cor-

responding hard model. The simplest hard model is the VQ model, which is effec-

tive for speaker recognition. This chapter proposes new hard models—hard HMMs

and hard GMMs. These models emerge as interesting consequences of investigat-

ing fuzzy approaches. The hard HMM employs only the best path for estimating

model parameters and for recognition and the hard GMM employs only the most

likely Gaussian distribution among the mixture of Gaussians to represent a fea-

ture vector. Hard models can be very useful because they are simple yet efficient

[Tran et al. 2000a, Tran et al. 2000b].

1.4.4 A Fuzzy Approach to Speaker Verification

An even more interesting fuzzy approach is proposed in Chapter 6. The speaker ver-

ification process is reconsidered from the viewpoint of fuzzy set theory and hence

a likelihood transformation and seven fuzzy normalisation methods are proposed

Page 24: Fuzzy Approaches to Speech and Speaker Recognition

1.5 Extensions of This Thesis 9

[Tran and Wagner 2000c, Tran and Wagner 2000d]. This fuzzy approach also leads

to a noise clustering-based version for all normalisation methods, which improves

speaker verification performance markedly.

1.4.5 Evaluation Experiments and Results

The evaluation of FE, FCM and hard models is presented in Chapter 7. Proposed

normalisation methods for speaker verification are also evaluated in this chapter.

The three speech corpora used in the experiments were the TI46, the ANDOSL and

the YOHO corpora. Isolated word recognition experiments were performed on the

E set, 10-digit set, 10-command set and 46-word set of the TI46 corpus. Speaker

identification and verification experiments were performed on the TI46 (16 speakers),

the ANDOSL (108 speakers) and the YOHO (138 speakers) corpora. Experiments

show that fuzzy models and their noise clustering versions outperform conventional

models in most of experiments. Hard hidden Markov models also achieved good

results.

1.5 Extensions of This Thesis

The fuzzy membership function in fuzzy set theory and the a posteriori probability in

the Bayes decision theory have very similar meanings, however it can be shown that

the minimum Bayes risk for the recognition process is obtained by the maximum a

posteriori probability rule whereas the maximum membership rule does not lead to

such a minimum risk. This problem can be overcome by using a well developed branch

of fuzzy set theory, namely posibility theory. A possibilistic pattern recognition ap-

proach in our viewpoint is nearly as developed as the statistical pattern recognition

approach. In the last chapter, we present the fundamentals of possibility theory and

propose a possibilistic C-means approach to speech and speaker recognition. Future

research into the possibility approach is suggested.

Page 25: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 2

Literature Review

This chapter provides a background review of statistical modelling techniques in speech

and speaker recognition and clustering techniques in pattern recognition. The essen-

tial characteristics of speech signals and speech processing are summarised in Section

2.1. An overview of the various disciplines required for understanding aspects of speech

and speaker recognition is presented in Section 2.2. Statistical modelling techniques are

overviewed in Section 2.3. In this section, we first attend to the basic issues in sta-

tistical pattern recognition techniques including Bayes decision theory, the distribution

estimation problem, maximum likelihood estimation and the expectation-maximisation

algorithm. Second, three widely used statistical modelling techniques—hidden Markov

modelling, Gaussian mixture modelling and vector quantisation—are described. Fuzzy

cluster analysis techniques are overviewed in Section 2.4. Fuzzy set theory, the fuzzy

membership function, the role of cluster analysis in pattern recognition and three basic

clustering techniques—hard C-means, fuzzy C-means and noise clustering—are reviewed

in this section. The last section 2.5 reviews the literature on fuzzy approaches to speech

and speaker recognition.

2.1 Speech Characteristics

Speech is the most natural means of communication among human beings, therefore it

plays a key role in the development of a natural interface to enhance human-machine

communication. This section briefly presents the nature of speech sounds and features

10

Page 26: Fuzzy Approaches to Speech and Speaker Recognition

2.1 Speech Characteristics 11

of speech signals that lead to methods used to process speech.

2.1.1 Speech Sounds

Speech is produced as a sequence of speech sounds corresponding to the message

to be conveyed. The state of the vocal cords as well as the positions, shapes, and

sizes of the various articulators, change over time in the speech production process

[O’Shaughnessy 1987]. There are three states of the vocal cords: silence, unvoiced and

voiced. Unvoiced sounds are produced when the glottis is open and the vocal cords

are not vibrating, so the resulting speech waveform is aperiodic or random in nature.

Voiced sounds are produced by forcing air through the glottis with the tension of the

vocal cords adjusted so that they vibrate in a relaxation oscillation with a resulting

speech waveform which is quasi-periodic [Rabiner and Schafer 1978]. Note that the

segmentation of the waveform into well-defined regions of silence, unvoiced, and voiced

signals is not exact. It is difficult to distinguish a weak, unvoiced sound from silence, or

a weak, voiced sound from unvoiced sounds or even silence [Rabiner and Juang 1993].

Phonemes are the smallest distinctive class of individual speech sounds in a lan-

guage. The number of phonemes varies according variant to different linguists. Vowel

sounds are produced by exciting an essentially fixed vocal tract shape with quasi-

periodic pulses of air caused by the vibration of the vocal cords. Vowels have the

largest amplitudes among phonemes and range in duration from 50 to 400 ms in nor-

mal speech [O’Shaughnessy 1987]. Diphthong sounds are gliding monosyllabic speech

sounds that start at or near the articulatory position for one vowel and move to or to-

ward the position for another. They are produced by varying the vocal tract smoothly

between vowel configurations appropriate to the diphthong [Rabiner and Juang 1993].

The group of sounds consisting of /w/, /l/, /r/, and /j/ is quite difficult to charac-

terise, /l/ and /r/ are called semivowels because of their vowel-like nature, /w/ and

/j/ are called glides because they are generally characterised by a brief gliding tran-

sition of the vocal tract shape between adjacent phonemes. The nasal consonants

/m/, /n/, and /η/ are produced with glottal excitation and the vocal tract totally

constricted at some point along the oral passageway while the velum is open and

allows air flow through the nasal tract. The unvoiced fricatives /f/, /θ/, /s/, and

/sh/ are produced by air flowing over a constriction in the vocal tract, with the lo-

Page 27: Fuzzy Approaches to Speech and Speaker Recognition

2.1 Speech Characteristics 12

cation of the constriction determining the particular fricative sound produced. The

voiced fricatives /v/, /th/, /z/, and /zh/ are the counterparts of the unvoiced frica-

tives /f/, /θ/, /s/, and /sh/, respectively. For voiced fricatives, the vocal cords

are vibrating, and thus one excitation source is at the glottis. The unvoiced stop

consonants /p/, /t/, and /k/, and the voiced stop consonants /b/, /d/, and /g/

are transient, noncontinuant sounds produced by building up pressure behind a to-

tal constriction somewhere in the oral tract and then suddenly releasing the pres-

sure. For the voiced stop consonants, the release of sound energy is accompanied

by vibrating vocal cords while for the unvoiced stop consonants the glottis is open

[Harrington and Cassidy 1996, Rabiner and Juang 1993, Flanagan 1972].

2.1.2 Speech Signals

The speech signal produced by the human vocal system is one of the most complex

signals known. In addition to the inherent physiological complexity of the human

vocal tract, the physical production system differs from one person to another. Even

when an utterance is repeated by the same person, the observed speech signal is

different each time. Moreover, the speech signal is influenced by the speaking envi-

ronment, the channel used to transmit the signal, and, when recording it, also by the

transducer used to capture the signal [Rabiner et al. 1996].

The speech signal is a slowly time varying signal in the sense that, when exam-

ined over a sufficiently short period of time, depending on the speech sound between

5 and 100 ms, its characteristics are approximately stationary. But over longer peri-

ods of time, the signal characteristics are non-stationary. They change to reflect the

sequence of different speech sounds being spoken [Juang et al. 1996]. An illustration

of this characteristic is given in Figure 2.1, which shows the time waveform corre-

sponding to the word “one” as spoken by a female speaker. The non-stationarity is

observed in the long period of time from t = 0.3 sec to t = 0.6 sec (300 msec) in

Figure 2.1.a. In the short period of time from t = 0.4 sec to t = 0.42 sec (20 msec),

Figure 2.1.b shows the stationarity of the speech signal.

The “quasi-stationarity” is the first characteristic of speech that distinguishes

it from other random, non-stationary signals. Based on this characterisation of the

speech signal, a reasonable speech model should have the following components. First,

Page 28: Fuzzy Approaches to Speech and Speaker Recognition

2.1 Speech Characteristics 13

Figure 2.1: The speech signal of the utterance “one” (a) in the long period of time

from t = 0.3 sec to t = 0.6 sec and (b) in the short period of time from t = 0.4 sec to

t = 0.42 sec.

short-time measurements at an interval of the order of 10 ms are to be made along

the pertinent speech dimensions that best carry the relevant information for linguis-

tic or speaker distinction. Second, because of the existence of the quasi-stationary

region, the neighbouring short-time measurements on the order of 100 ms need to

be simultaneously considered, either as a group of identically and independently dis-

tributed observations or as a segment of a non-stationary random process covering

two quasi-stationary regions. The last component is a mechanism that describes the

sound change behaviour among the sound segments in the utterance. This character-

istic takes into account the implicit structure of the utterance, words, syntax, and so

on in a probability distribution sense [Juang et al. 1996].

2.1.3 Speech Processing

The speech signal can be parametrically represented by a number of variables related

to short time energy, fundamental frequency and sound spectrum. Probably the most

important parametric representation of speech is the short time spectral envelope, in

Page 29: Fuzzy Approaches to Speech and Speaker Recognition

2.1 Speech Characteristics 14

which the two most common choices of spectral analysis are the filterbank and the

linear predictive coding spectral analysis models. In the filterbank model, the speech

signal is passed through a bank of Q bandpass filters whose coverage spans the fre-

quency range of interest in the signal (e.g., 300-3000 Hz for telephone-quality signals,

50-8000 Hz for broadband signals). The individual filters can overlap in frequency

and the output of a bandpass filter is the short-time spectral representation of the

speech signal. The linear predictive coding (LPC) model performs spectral analy-

sis on blocks of speech (speech frames) with an all-pole modelling constraint. Each

individual frame is windowed so as to minimise the signal discontinuities at the be-

ginning and end of each frame. Thus the output of the LPC spectral analysis block

is a vector of coefficients that specify the spectrum of an all-pole model that best

matches the speech signal [Rabiner and Schafer 1978]. The most common parameter

set derived from either LPC or filterbank spectra is a vector of cepstral coefficients.

The cepstral coefficients cm(t), m = 1, . . . , Q, which are the coefficients of the Fourier

transform representation of the log magnitude spectrum of speech frame t, have been

shown to be a more robust, reliable feature set for speech recognition than the spec-

tral vectors. The temporal cepstral derivatives (also known as delta cepstra) are

often used as additional features to model trajectory information [Campbell 1997].

For each speech frame, the result of the feature analysis is a vector of Q weighted

cepstral coefficients and an appended vector of Q cepstral time derivatives as follows

[Rabiner and Juang 1993]

x′t = (c1(t), . . . , cQ(t),4c1(t), . . . ,4cQ(t)) (2.1)

where t is the index of the speech frame, x′t is tranpose of vector xt with 2Q compo-

nents, and cm(t) = wmcm(t), wm is the weighting function to truncate the computation

and de-emphasizes cm around m = 1 and around m = Q

wm = 1 +Q

2sin

(πm

Q

)1 ≤ m ≤ Q (2.2)

If second-order temporal derivatives 42cm(t) are computed, these are appended to

vector xt giving a vector with 3Q components.

A block diagram of the LPC front-end processor is shown in Figure 2.2. The speech

signal is preemphasised for spectrally flattening, then is blocked into frames. Frames

Page 30: Fuzzy Approaches to Speech and Speaker Recognition

2.2 Speech and Speaker Recognition 15

Digitised Speech ✲ Pre-emphasis

✲ FrameBlocking

✲ Windowing

❄ParameterComputing

✛ParameterWeighting

TemporalDerivative

cm(t)

Δcm(t)

Figure 2.2: Block diagram of LPC front-end processor for speech and speaker recog-

nition

are Hamming windowed, a typical window used for the autocorrelation method of

LPC. Then the cepstral coefficients cm(t) weighted by wm and the temporal cepstral

derivatives are computed for each frame.

2.1.4 Summary

The quasi-stationarity is an important characteristic of speech. The speech signal

after spectral analysis is converted into a feature vector sequence. The current most

commonly used short-term measurements are cepstral coefficients which form a ro-

bust, reliable feature set of speech for speech and speaker recognition.

2.2 Speech and Speaker Recognition

Recognising the liguistic content in a spoken utterance and identifying the talker of

the utterance through developing algorithms and implementing them on machines are

the goal of speech and speaker recognition [C.-H. Lee et al. 1996]. A brief overview

of some of the fundamental aspects of speech and speaker recognition is given in this

section.

2.2.1 Speech Recognition

Broadly speaking, there are three approaches to speech recognition by machine,

namely, the acoustic-phonetic approach, the pattern-recognition approach, and the

artificial intelligence approach. The acoustic-phonetic approach is based on the theory

Page 31: Fuzzy Approaches to Speech and Speaker Recognition

2.2 Speech and Speaker Recognition 16

of acoustic phonetics that postulates that there exists a finite set of distinctive pho-

netic units in spoken language and that the phonetic units are broadly characterised

by sets of properties that are manifest in the speech signal or its spectrum over time.

The problem with this approach is the fact that the degree to which each phonetic

property is realised in the acoustic signal varies greatly between speakers, between

phonetic contexts and even between repeated realisations of the phoneme by the same

speaker in the same context. This approach generally requires the segmentation of the

speech signal into acoustic-phonetic units and the identification of those units through

their known properties or features. The pattern-recognition approach is basically one

in which the speech patterns are used directly without explicit feature determina-

tion and segmentation. This method has two steps: training of speech patterns, and

recognition of patterns via pattern comparison. The artifical intelligence approach is

a hybrid of the acoustic-phonetic approach and the pattern-recognition approach in

that it exploits ideas and concepts of both methods, especially the use of an expert

system for segmentation and labelling, and the use of neural networks for learning the

relationships between phonetic events and all known inputs. Currently, the pattern-

recognition approach is the method of choice for speech recognition because of its

simplicity of use, proven high performance, robustness to different acoustic-phonetic

realisations and invariance to different speech vocabularies, users, algorithms and

decision rules [Rabiner and Juang 1993].

Depending on the mode of speech that the system is designed to handle, three

tasks of speech recognition can be distinguished: isolated-word, connected-word, and

continuous speech recognition. Continuous speech recognition allows natural conver-

sational speech—150-250 words/min, with little or no adaptation of speaking style

imposed on system users. Isolated-word recognition requires the speaker to pause for

at least 100-250 ms after each word. It is unnatural for speakers and slows the pro-

cessing rate to about 20-100 words/min. Continuous speech recognition is much more

difficult than isolated word recognition due to the absence of word boundary informa-

tion. Connected-word speech recognition represents a compromise between the two ex-

tremes, the speaker needs not pause but must pronounce and stress each word clearly

[O’Shaughnessy 1987]. Restrictions on the vocabulary size differentiate speech recog-

nition systems. Small vocabulary is about 100-200 words, large vocabulary is about

Page 32: Fuzzy Approaches to Speech and Speaker Recognition

2.2 Speech and Speaker Recognition 17

1000 words and very large vocabulary—5000 words or greater. An alternative factor af-

fecting speech recognition performance is speaker dependence/independence. Gener-

ally, speaker-dependent systems achieve better recognition performance than speaker-

independent systems—identifying speech from many talkers—because of the limited

variability in the speech signal coming from a single speaker. Speaker-dependent sys-

tems demonstrate good performance only for speakers who have previously trained

the system [Kewley-Port 1995].

Research in automatic speech and speaker recognition by machine has been con-

ducted for almost four decades and the earliest attempts to devise systems for auto-

matic speech recognition by machine were made in the 1950s. Several fundamental

ideas in speech recognition were published in the 1960s and speech-recognition re-

search achieved a number of significant milestones in the 1970s. Just as isolated

word recognition was a key focus of research in the 1970s, the problem of connected

word recognition was a focus of research in the 1980s. Speech research in the 1980s

was characterised by a shift in technology from template-based approaches to statisti-

cal modelling methods, especially the hidden Markov model approach [Rabiner 1989].

Since then, hidden Markov model techniques have become widely applied in virtually

every speech-recognition system.

Speech recognition systems have been developed for a wide variety of applications

both within telecommunications and in the business arena. In telecommunications, a

speech recognition system can provide information or access to data or services over

the telephone line. It can also provide recognition capability on the desktop/office

including voice control of PC and workstation environments. In manufacturing and

business, a recognition capability is provided to aid in the manufacturing processes.

Other applications include the use of speech recognition in toys and games. A signif-

icant portion of the research in speech processing in the past few years has gone into

studying practical methods for speech recognition in the world. In the United States,

major research efforts have been carried out at AT&T (the Next-Generation Text-to-

Speech System) [AT&T’s web site], IBM (ViaVoice Speech Recognition and the TAN-

GORA System) [IBM’s web site, Das and Picheny 1996], BBN (the BYBLOS and

SPIN Systems) [BBN’s web site], Dragon (the Dragon NaturallySpeaking Products)

[DRAGON’s web site], CMU (the SPHINX-II Systems) [Huang et al. 1996], Lincoln

Page 33: Fuzzy Approaches to Speech and Speaker Recognition

2.2 Speech and Speaker Recognition 18

Laboratory [Paul 1989] and MIT (the Spoken Language Systems) [MIT’s web site].

The Hearing Health Care Research Unit Projects [Western Ontario’s web site] and

the INRS 86,000-word isolated word recognition system in Canada as well as the

Philips rail information system, the CSELT system for Eurorail information ser-

vices [CSELT’s web site], the University of Duisburg [Duisburg’s web site], the Cam-

bridge University systems [Cambridge’s web site], and the LIMSI voice recognition

[LIMSI’s web site] in Europe, are examples of the current activity in speech recog-

nition research. Large vocabulary recognition systems are being developed based on

the concept of interpreting telephony and telephone directory assistance in Japan .

Syllable-recognisers have been designed to handle large vocabulary Mandarin dicta-

tion in China and Taiwan [Rabiner et al. 1996].

2.2.2 Speaker Recognition

Compared to speech recognition, there has been much less research in speaker recogni-

tion because fewer applications exist than for speech recognition. Speaker recognition

is the process of automatically recognising who is speaking based on information ob-

tained from speech waves. Speaker recognition techniques can be used to verify the

identity claimed by people accessing certain protected systems. It enables access con-

trol of various services by voice. Voice dialing, banking over a telephone network,

database access services, security control for confidential information, remote access

of computers and the use for forensic purposes are important applications of speaker

recognition technology [Furui 1997, Kunzel 1994].

Variation in signal characteristics from trial to trial is the most important fac-

tor affecting speaker recognition performance. Variations arise not only between

speakers themselves but also from differences in recording and transmission condi-

tions, noise and from a variety of psychological and physiological function within

an individual speaker. Normalisation and adaptation techniques have been applied

to compensate for these variations [Matsui and Furui 1993, Rosenberg et al. 1992,

Higgins et al. 1991, Varga and Moore 1990, Gish 1990]. Speaker recognition can be

classified into two specific tasks: identification and verification. Speaker identifica-

tion is the process of determining which one of the voices known to the system best

matches the input voice sample. When an unknown speaker must be identified as one

Page 34: Fuzzy Approaches to Speech and Speaker Recognition

2.2 Speech and Speaker Recognition 19

of the set of known speakers, the task is known as closed-set speaker identification. If

the input voice sample does not have a close enough match to anyone of the known

speakers and the system can produce a“no match” decision [Reynolds 1992], the task

is known as open-set speaker identification. Speaker verification is the process of ac-

cepting or rejecting the identity claim of a speaker. An identity claim is made by

an unknown speaker, and an utterance of this unknown speaker is compared with

the model for the speaker whose identity is claimed. If the match is good enough,

that is, above a given threshold, the identity claim is accepted. The use of a “cohort

speaker” set that is representative of the population close to the claimed speaker

has been proposed [Rosenberg et al. 1992]. In all verification paradigms, there are

two classes of errors: false rejections and false acceptances. A false rejection occurs

when the system incorrectly rejects a true spaker and a false acceptance occurs when

the system incorrectly accepts an imposter. An equal error rate condition is often

used to adjust system parameters so that the two types of errors are equally likely

[O’Shaughnessy 1987].

Speaker recognition methods can also be divided into text-dependent and text-

independent. When the same text is used for both training and testing, the system is

said to be text-dependent. For text-independent operation, the text used to test the

system is theoretically unconstrained [Furui 1996]. Both text-dependent and inde-

pendent speaker recognition systems can be defeated by playing back recordings of a

registered speaker. To overcome this problem, a small set of pass phrases can be used,

one of which is randomly chosen every time the system is used [Higgins et al. 1991].

Another method is text-prompted speaker recognition, which prompts the user for a

new pass phrase or every occasion [Matsui and Furui 1993]. An extension of speaker

recognition technology is the automatic extraction of the “turns” of each speaker

from a dialogue involving two or more speakers [Gish et al. 1991, Siu et al. 1992,

Wilcox et al. 1994].

2.2.3 Summary

Statistical pattern recognition is the method of choice for speech and speaker recog-

nition, in which hidden Markov modelling of the speech signal is the most important

technique that has helped to advance the state of the art of automatic speech and

Page 35: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 20

speaker recognition.

2.3 Statistical Modelling Techniques

This section begins with a brief review of the classical Bayes decision theory and

its application to the formulation of statistical pattern recognition problems. The

distribution estimation problem in classifier design is then discussed in the Bayes

decision theory framework. The maximum likelihood estimation is reviewed as one

of the most widely used parametric unsupervised learning methods. Finally, three

widely-used techniques—hidden Markov modelling, Gaussian mixture modelling and

vetor quantisation modelling are reviewed in this framework.

2.3.1 Maximum A Posteriori Rule

The task of a recogniser (classifier) is to achieve the minimum recognition error rate.

A loss function is defined to measure this performance. It is generally non-negative

with a value of zero representing correct recognition [Juang et al. 1996]. Let X be

a random observation sequence from an information source, consisting of M classes

of events. The task of a recogniser is to correctly classify each X into one of the

M classes Ci, i = 1, 2, . . . ,M . Suppose that when X truly belongs to class Cj but

the recogniser classifies X as belonging to class Ci, we incur a loss ℓ(Ci|Cj). Since

the a posteriori probability P (Cj|X) is the probability that the true class is Cj, the

expected loss or risk associated with classifying X to class Ci is [Duda and Hart 1973]

R(Ci|X) =M∑

j=1

ℓ(Ci|Cj)P (Cj|X) (2.3)

In speech and speaker recognition, the following zero-one loss function is usually

chosen

ℓ(Ci|Cj) =

{0 i = j

1 i 6= ji, j = 1, . . . ,M (2.4)

This loss function assigns no loss to correct classification and a unit loss to any error

regardless of the class. The conditional loss becomes

R(Ci|X) =∑j 6=i

P (Cj|X) = 1 − P (Ci|X) (2.5)

Page 36: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 21

Therefore in order to achieve the minimum error rate classification, we have to select

the decision that Ci is correct, if the a posteriori probability P (Ci|X) is maximum

[Allerhand 1987]

C(X) = Ci if P (Ci|X) = max1≤j≤M

P (Cj|X) (2.6)

The decision rule of (2.6) is called the maximum a posteriori (MAP) decision rule

and the minimum error rate achieved by the MAP decision is called Bayes risk

[Duda and Hart 1973].

2.3.2 Distribution Estimation Problem

For the implementation of the MAP rule, the required knowledge for an optimal

classification decision is thus the set of a posteriori probabilities. However, these

probabilities are not known in advance and have to be estimated from a training set

of observations with known class labels. The Bayes decision theory thus effectively

transforms the classifier design problem into the distribution estimation problem. This

is the basis of the statistical approach to pattern recognition [Juang et al. 1996].

The a posteriori probability can be computed by using the Bayes rule

P (Cj|X) =P (X|Cj)P (Cj)

P (X)(2.7)

It can be seen from (2.7) that decision making based on the a posteriori probability

employs both a priori knowledge from the a priori probability P (Cj) together with

present observed data from the conditional probability P (X|Cj). For the simple

case of isolated word recognition, the observations are the word utterances and the

class labels are the word identities. The conditional probability P (X|Cj) is often

referred to as the acoustic model and the a priori probability P (Cj) is known as

the language model [C.-H. Lee et al. 1996]. In order to be practically implementable,

the acoustic models are usually parameterised, and thus the distribution estimation

problem becomes a parameter estimation problem, where a reestimation algorithm and

a set of parameter estimation equations are established to find the best parameter set

λi for each class Ci, i = 1, . . . ,M based on the given optimisation criterion. The final

task is to determine the right parametric form of the distributions and to estimate

the unknown parameters defining the distribution from the training data. To obtain

Page 37: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 22

reliable parameter estimates, the training set needs to be of sufficient size in relation to

the number of parameters. However, collecting and labelling data are labor intensive

and resource demanding processes. When the amount of the training data is not

sufficient, the quality of the distribution parameter estimates cannot be guaranteed.

In other words, a true MAP decision can rarely be implemented and the minimum

Bayes risk generally remains an unachievable lower bound [Juang et al. 1996].

2.3.3 Maximum Likelihood Estimation

As discussed above, the distribution estimation problem P (X|C) in the acoustic mod-

elling approach becomes the parameter estimation problem P (X|λ). If λj denotes the

parameter set used to model a particular class Cj, the likelihood function of model

λj is defined as the probability P (X|λj) treated as a function of the model λj. Max-

imising P (X|λj) over λj is referred to as the maximum likelihood (ML) estimation

problem. It has been shown that if the model is capable of representing the true

distribution and enough training data is available, the ML estimate will be the best

estimate of the true parameters [Nadas 1983]. However, if the form of the distribution

is not known or the amount of training data is insufficient, the resulting parameter

set is not guaranteed to produce a Bayes classifier.

The expectation-maximisation (EM) algorithm proposed by Dempster, Laird and

Rubin [1977] is a general approach to the iterative computation of ML estimates

when the observation sequence can be viewed as incomplete data. Each iteration

of this algorithm consists of an expectation (E) step followed by a maximisation

(M) step. Many of its extensions and variations are popular tools for modal in-

ference in a wide variety of statistical models in the physical, medical and biological

sciences [Booth and Hobert 1999, Liu et al. 1998, Freitas 1998, Ambroise et al. 1997,

Ghahramani 1995, Fessler and Hero 1994, Liu and Rubbin 1994].

In unsupervised learning, information on the class and state is unavailable, there-

fore the class and state are unobservable and only the data X are observable. Ob-

servable data are called incomplete data because they are missing the unobservable

data, and data composed both of observable data and unobservable data are called

complete data [Huang et al. 1990]. The purpose of the EM algorithm is to maximise

the log-likelihood log P (X|λ) from incomplete data. Suppose a measure space Y of

Page 38: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 23

unobservable data exists corresponding to a measure space X of observable (incom-

plete) data. For given X ∈ X, Y ∈ Y, and the parameter model set λ, let P (X|λ)

and P (Y |λ) be probability distribution functions defined on X and Y respectively.

To maximise the log-likelihood of the observable data X over λ, we obtain

L(X,λ) = log P (X|λ) = log P (X,Y |λ) − log P (Y |X,λ) (2.8)

For two parameter sets λ and λ, the expectation of the incomplete log-likelihood

L(X,λ) over the complete data (X,Y ) conditioned by X and λ is

E[L(X, λ)|X,λ] = E[log P (X|λ)|X,λ] = log P (X|λ) = L(X, λ) (2.9)

where E[.|X,λ] is the expectation conditioned by X and λ over complete data (X,Y ).

Using (2.8), we obtain

L(X, λ) = Q(λ, λ) − H(λ, λ) (2.10)

where

Q(λ, λ) = E[log P (X,Y |λ)|X,λ] and H(λ, λ) = E[log P (Y |X, λ)|X,λ] (2.11)

The basis of the EM algorithm lies in the fact that if Q(λ, λ) ≥ Q(λ, λ) then L(X, λ) ≥L(X,λ) since it follows from Jensen’s inequality that H(λ, λ) ≤ H(λ, λ). This implies

that L(X,λ) increases monotonically on any iteration of parameter updates from λ

to λ via maximisation of the Q-function. When Y is discrete, the Q-function and the

H-function are represented as

Q(λ, λ) =∑

Y ∈Y

P (Y |X,λ) log P (X,Y |λ) and H(λ, λ) =∑

Y ∈Y

P (Y |X,λ) log P (Y |X, λ)

(2.12)

The following EM algorithm permits an easy maximisation of the Q-function

instead of maximising L(X,λ) directly.

Algorithm 1 (The EM Algorithm)

1. Initialisation: Fix Y and choose an initial estimate λ

2. E-step: Compute Q(λ, λ) based on the given λ

3. M-step: Use a certain optimisation method to determine λ , for which Q(λ, λ) ≥Q(λ, λ)

Page 39: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 24

4. Termination: Set λ = λ, repeat from E-step until the change of Q(λ, λ) falls

below a preset threshold.

2.3.4 Hidden Markov Modelling

The underlying assumption of the HMM is that the speech signal can be well char-

acterised as a parametric random process, and that the parameters of the stochastic

process can be estimated in a precise, well-defined manner. The HMM method pro-

vides a reliable way of recognizing speech for a wide range of applications [Juang 1998,

Ghahramani 1997, Furui 1997, Rabiner et al. 1996, Das and Picheny 1996].

There are two assumptions in the first-order HMM. The first is the Markov as-

sumption, i.e. a new state is entered at each time t based on the transition probability,

which only depends on the previous state. It is used to characterise the sequence of

the time frames of a speech pattern. The second is the output-independence assump-

tion, i.e. the output probability depends only on the state at that time regardless

of when and how the state is entered [Huang et al. 1990]. A process satisfying the

Markov assumption is called a Markov model [Kulkarni 1995]. An observable Markov

model is a process where the output is a set of states at each instant of time and

each state corresponds to an observable event. The hidden Markov model is a doubly

stochastic process with an underlying Markov process which is not directly observable

(hidden) but which can be observed through another set of stochastic processes that

produce observable events in each of the states [Rabiner and Juang 1993].

Parameters and Types of HMMs

Let O = (o1 o2 . . . oT ) be the observation sequence, S = (s1 s2 . . . sT ) the unob-

servable state sequence, X = (x1 x2 . . .xT ) the continuous vector sequence, V =

{v1, v2, . . . , vK} the discrete symbol set, and N the number of states. A compact

notation λ = {π,A,B} is proposed to indicate the complete parameter set of the

HMM [Rabiner and Juang 1993], where

• π = {πi}, πi = P (s1 = i|λ), 1 ≤ i ≤ N : the initial state distribution;

• A = {aij}, aij = P (st+1 = j|st = i, λ), 1 ≤ i, j ≤ N , and 1 ≤ t ≤ T − 1:

the state transition probability distribution, denoting the transition probability

Page 40: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 25

from state i at time t to state j at time t + 1; and

• B = {bj(ot)}, bj(ot) = P (ot|st = j, λ), 1 ≤ j ≤ N , and 1 ≤ t ≤ T : the

observation probability distribution, denoting the probability of generating an

observation ot in state j at time t with probability bj(ot).

One way to classify types of HMMs is by the structure of the transition matrix A

of the Markov chain [Huang et al. 1990]:

• Ergodic or fully connected HMM: every state can be reached from every other

state in a finite number of states. The initial state probabilities and the state

transition coefficients have the properties [Rabiner and Juang 1986]

0 ≤ πi ≤ 1,N∑

i=1

πi = 1 and 0 ≤ aij ≤ 1,N∑

j=1

aij = 1 (2.13)

• Left-to-right: as time increases, the state index increases or stays the same.

The state sequence must begin in state 1 and end in state N , i.e. πi = 0 if

i 6= 1 and πi = 1 if i = 1. The state- transition coefficients satisfy the following

fundamental properties

aij = 0 j < i, 0 ≤ aij ≤ 1, andN∑

j=1

aij = 1 (2.14)

The additional constraint aij = 0, j > (i + Δi), where Δi > 0 is often placed

on the state-transition coefficients to make sure that large changes in state

indices do not occur [Rabiner 1989]. Such a model is called the Bakis model

[Bakis 1976], i.e. a left-to-right model which allows some states to be skipped.

An alternative way to classify types of HMMs is based on observations and their

representations [Huang et al. 1990]

• Discrete HMM (DHMM): the observations ot, 1 ≤ t ≤ T are discrete symbols

in V = {v1, v2, . . . , vK}, which are normally codevector indices of a VQ source-

coding technique, and

B = {bj(k)}, 1 ≤ j ≤ N, 1 ≤ k ≤ K,

bj(k) = P (ot = vk|st = j, λ),K∑

k=1

bj(k) = 1 (2.15)

Page 41: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 26

• Continuous HMM (CHMM): the observations ot ∈ O are vectors xt ∈ X and

the parametric representation of the observation probabilities is a mixture of

Gaussian distributions

B = {bj(xt)}, 1 ≤ j ≤ N, 1 ≤ t ≤ T

bj(xt) = P (xt|st = j, λ) =K∑

k=1

wjkN(xt, µjk,Σjk),∫X

bj(xt)dxt = 1 (2.16)

where wjk is the kth mixture weight in state j satisfying∑K

k=1 wjk = 1 and

N(xt, µjk,Σjk) is the kth Gaussian component density in state j with mean

vector µjk and covariance matrix Σjk (see Section 2.3.5 for detail).

Other variants have been proposed, such as factorial HMMs [Ghahramani 1997],

tied-mixture continuous HMMs [Bellegarda and Nahamoo 1990] and semi-continuous

HMMs [Huang and Jack 1989].

Three Basic Problems for HMMs

There are three basic problems to be solved for HMMs. The parameter estimation

problem is to train speech and speaker models, the evaluation problem is to compute

likelihood functions for recogniton and the decoding problem is to determine the best

fitting (unobservable) state sequence [Rabiner and Juang 1993, Huang et al. 1990].

The parameter estimation problem: This problem determines the optimal

model parameters λ of the HMM according to given optimisation criterion. A vari-

ant of the EM algorithm, known as the Baum-Welch algorithm, yields an iterative

procedure to reestimate the model parameters λ using the ML criterion [Baum 1972,

Baum and Sell 1968, Baum and Eagon 1967]. In the Baum-Welch algorithm, the un-

observable data are the state sequence S and the observable data are the observation

sequence O. From (2.12), the Q-function for the HMM is as follows

Q(λ, λ) =∑S

P (S|O, λ) log P (O,S|λ) (2.17)

Computing P (O,S|λ) [Rabiner and Juang 1993, Huang et al. 1990], we obtain

Q(λ, λ) =T−1∑t=0

∑st

∑st+1

P (st, st+1|O, λ) log [astst+1bst+1(ot+1)] (2.18)

Page 42: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 27

where πs1 is denoted by as0s1 for simplicity. Regrouping (2.18) into three terms for

the π, A, B coefficients, and applying Lagrange multipliers, we obtain the HMM

parameter estimation equations

• For discrete HMM:

πj = γ1(i), aij =

T−1∑t=1

ξt(i, j)

T−1∑t=1

γt(i)

, bj(k) =

T∑t=1

s.t. ot=vk

γt(j)

T∑t=1

γt(j)

(2.19)

where

γt(i) =N∑

j=1

ξt(i, j),

ξt(i, j) = P (st = i, st+1 = j|O, λ) =αt(i)aijbj(ot+1)βt+1(j)

N∑i=1

N∑j=1

αt(i)aijbj(ot+1)βt+1(j)

(2.20)

• For continuous HMM: estimation equations for the π and A distributions are

unchanged, but the output distribution B is estimated via Gaussian mixture

parameters as represented in (2.16)

wjk =

T∑t=1

ηt(j, k)

T∑t=1

K∑k=1

ηt(j, k)

, µjk =

T∑t=1

ηt(j, k)xt

T∑t=1

ηt(j, k)

,

Σjk =

T∑t=1

ηt(j, k)(xt − µjk)(xt − µjk)′

T∑t=1

ηt(j, k)

(2.21)

where

ηt(j, k) =αt(j)βt(j)

N∑j=1

αt(j)βt(j)

× wjkN(xt, µjk,Σjk)K∑

k=1

wjkN(xt, µjk,Σjk)

(2.22)

Note that for practical implementation, a scaling procedure [Rabiner and Juang 1993]

is required to avoid number underflow on computers with ordinary floating-point

number representations .

Page 43: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 28

The evaluation problem: How can we efficiently compute P (O|λ), the proba-

bility that the observation sequence O was produced by the model λ?

For solving this problem, we obtain

P (O|λ) =∑all S

P (O,S|λ) =∑

s1,s2,...,sT

πs1bs1(o1)as1s2bs2(o2) . . . asT−1sTbsT

(oT ) (2.23)

An interpretation of the computation in (2.23) is the following. At time t = 1, we are

in state s1 with probability πs1 , and generate the symbol o1 with probability bs1(o1).

A transition is made from state s1 at time t = 1 to state s2 at time t = 2 with

probability as1s2 and we generate a symbol o2 with probability bs2(o2). This process

continues in this manner until the last transition at time T from state sT−1 to state sT

is made with probability asT−1sTand we generate symbol oT with probability bsT

(oT ).

Figure 2.3 shows an N -state left-to-right HMM with Δi in (2.14) set to 1.

Figure 2.3: An N -state left-to-right HMM with Δi = 1

To reduce computations, the forward and the backward variables are used. The

forward variable αt(i) is defined as

αt(i) = P (o1o2 . . . ot, st = i|λ), which can be computed iteratively as

α1(i) = πibi(o1), 1 ≤ i ≤ N

and αt+1(j) =[ N∑

i=1

αt(i)aij

]bj(ot+1), 1 ≤ j ≤ N, 1 ≤ t ≤ T − 1 (2.24)

Page 44: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 29

and the backward variable βt(i) is defined as

βt(i) = P (ot+1ot+2 . . . oT |st = i, λ), which can be computed iteratively as

βT (i) = 1, 1 ≤ i ≤ N

and βt(i) =N∑

j=1

aijbj(ot+1)βt+1(j), 1 ≤ i ≤ N, t = T − 1, . . . , 1 (2.25)

Using these variables, the probability P (O|λ) can be computed following the forward

variable or the backward variable or both the forward and backward variables as

follows

P (O|λ) =N∑

i=1

αT (i) =N∑

i=1

πibi(o1)β1(i) =N∑

i=1

αt(i)βt(i) (2.26)

The decoding problem: Given the observation sequence O and the model λ,

how do we choose a corresponding state sequence S that is optimal in some sense?

This problem attempts to uncover the hidden part of the model. There are several

possible ways to solve this problem, but the most widely used criterion is to find

the single best state sequence that can be implemented by the Viterbi algorithm. In

practice, it is preferable to base recognition on the maximum likelihood state sequence

since this generalises easily to the continuous speech case. This likelihood is computed

using the same algorithm as the forward algorithm except that the summation is

replaced by a maximum operation.

2.3.5 Gaussian Mixture Modelling

Gaussian mixture models (GMMs) are effective models capable of achieving high

recognition accuracy for speaker recognition. As discussed above, HMMs can ad-

equately characterise both the temporal and spectral varying nature of the speech

signal [Rabiner and Juang 1993], however for speaker recognition, the temporal in-

formation has been used effectively only in text-dependent mode. In text-independent

mode, there are no constraints on the training and test text and this temporal informa-

tion has not been shown to be useful [Reynolds 1992]. On the other hand, the perfor-

mance of text-independent speaker identification depends mostly on the total number

of mixture components (number of states times number of mixture components as-

signed to each state) per speaker model [Matsui and Furui 1993, Matsui and Furui 1992,

Reynolds 1992]. Therefore, it can be seen that, the N-state M-mixture continuous

Page 45: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 30

ergodic HMM is roughly equivalent to the NM-mixture GMM in text-independent

speaker recognition applications. In this case, the number of states does not play

an important role, and hence for simplicity, the 1-state HMM, i.e. the GMM is cur-

rently used for text-independent speaker recognition [Furui 1994]. Although we can

get equations for the GMM from the continuous HMM with the number of states

N = 1, for practical applications, they are summarised as follows.

The parameter estimation problem: The Q-function for the GMM is of the

form [Huang et al. 1990]

Q(λ, λ) =K∑

i=1

T∑t=1

P (i|xt, λ) log P (xt, i|λ) =K∑

i=1

T∑t=1

P (i|xt, λ) log[wiN(xt, µi,Σi)]

(2.27)

where P (i|xt, λ) is the a posteriori probability for the ith mixture, i = 1, . . . , K and

satisfies

P (i|xt, λ) =P (xt, i|λ)

K∑k=1

P (xt, k|λ)

=wiN(xt, µi,Σi)

K∑k=1

wkN(xt, µk,Σk)

(2.28)

λ = {w, µ,Σ} denotes a set of model parameters, where w = {wi}, µ = {µi}, Σ =

{Σi}, i = 1, . . . , K, wi are mixture weights satisfying∑K

i=1 wi = 1 and N(xt, µi,Σi)

are the d-variate Gaussian component densities with mean vectors µi and covariance

matrices Σi

N(xt, µi,Σi) =1

(2π)d2 |Σi|

12

exp{− 1

2(xt − µi)

′Σ−1i (xt − µi)

}(2.29)

where (xt −µi)′ is the transpose of (xt −µi), Σ−1

i is the inverse of Σi, and |Σi| is the

determinant of Σi.

Setting derivatives of the Q-function with respect to λ to zero, the following

reestimation formulas are found [Huang et al. 1990, Reynolds 1995b]

wi =1

T

T∑t=1

P (i|xt, λ), µi =

T∑t=1

P (i|xt, λ)xt

T∑t=1

P (i|xt, λ)

, Σi =

T∑t=1

P (i|xt, λ)(xt − µi)(xt − µi)′

T∑t=1

P (i|xt, λ)

(2.30)

The evaluation problem: For a training vector sequence X = (x1x2 . . .xT ),

Page 46: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 31

the likelihood of the GMM is

log P (X|λ) =T∑

t=1

log P (xt|λ) =T∑

t=1

logK∑

i=1

wiN(xt, µi,Σi) (2.31)

2.3.6 Vector Quantisation Modelling

Vector quantisation (VQ) is a data reduction method, which is used to convert

a feature vector set into a small set of distinct vectors using a clustering tech-

nique. Advantages of this reduction are reduced storage, reduced computation,

and efficient representation of speech sounds [Furui 1996, Bellegarda 1996]. The

distinct vectors are called codevectors and the set of codevectors that best rep-

resents the training vector set is called the codebook. The VQ codebook can be

used as a speech or speaker model and a good recognition performance can be

obtained in many cases [Rabiner et al. 1983, Soong et al. 1987, Tseng et al. 1987,

Tsuboka and Nakahashi 1994, Bellegarda 1996]. Since there is only a finite number

of code vectors, the process of choosing the best representation of a given feature

vector is equivalent to quantising the vector and leads to a certain level of quantisa-

tion error. This error decreases as the size of the codebook increases, however the

storage required for a large codebook is nontrivial. The key point of VQ modelling

is to derive an optimal codebook which is commonly achieved by using the hard C-

means (K-means) algorithm reviewed in Section 2.4.5. A variant of this algorithm is

the LBG algorithm [Linde et al. 1980], which is widely used in speech and speaker

recognition.

The difference between the GMM and VQ is the change from a “soft” mapping in

the GMM to a “hard” mapping in VQ of feature vectors into clusters [Chou et al. 1989].

In the GMM, we obtain

P (xt|λ) =K∑

i=1

wiN(xt, µi,Σi) (2.32)

It means that vector xt can belong to K clusters (soft mapping) represented by K

Gaussian distributions. The degree of belonging of xt to the ith cluster is represented

by the probability P (i|xt, λ) and is determined as in (2.28). For the GMM, we obtain

0 ≤ P (i|xt, λ) ≤ 1. For VQ, vector xt is or is not in the ith cluster and the probability

Page 47: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 32

P (i|xt, λ) is determined as follows [Chou et al. 1989, Duda and Hart 1973]

P (i|xt, λ) =

{1 if dit < djt ∀j 6= i

0 otherwise(2.33)

where ties are broken randomly, dit denotes the distance between vector xt to the ith

cluster. If a particular distance is defined and (2.33) is replaced into (2.30), variants

of VQ are determined as follows

• Conventional VQ: the Euclidean distance d2it = (xt − µi)

2 is used and

µi =1

Ti

∑xt∈cluster i

xt (2.34)

where Ti is the number of vectors in the ith cluster, and∑K

i=1 Ti = T .

• Extended VQ: the Mahalanobis distance d2it = (xt−µi)

′Σ−1i (xt−µi) is used and

µi =1

Ti

∑xt∈cluster i

xt, Σi =1

Ti

∑xt∈cluster i

(xt − µi)(xt − µi)′ (2.35)

• Entropy-Constrained VQ: d2it = (xt −µi)

′Σ−1(xt −µi)− 2 log wi, assuming that

Σi = Σ ∀i, Σ fixed [Chou et al. 1989], and

µi =1

Ti

∑xt∈cluster i

xt, wi =Ti

T(2.36)

The relationships between the statistical modelling techniques are summarised in

Figure 2.4

2.3.7 Summary

The statistical classifier based on the Bayes decision theory has been reviewed in this

section. The classifier design problem is to achieve the minimum recognition error

rate, which is performed by the MAP decision rule. Since the a posteriori probabilities

are not known in advance, the problem becomes a distribution estimation problem.

This is solved by determining the a priori probability and the likelihood function. As

discussed in Section 2.3.2, the former is derived from the language models in speech

recognition. In speaker identification, this is often simplified by assuming an equal

Page 48: Fuzzy Approaches to Speech and Speaker Recognition

2.3 Statistical Modelling Techniques 33

Modelling Techniques

❄ ❄DiscreteHMM

ContinuousHMM

λ = {π,A,B} λ = {π,A,B = {w, µ,Σ}}

❄GMM

(1-state HMM)λ = {w, µ, Σ}

Changing from soft to hard mapping

❄ ❄ ❄

ECVQ

λ = {µ,w}

ExtendedVQ

λ = {µ,Σ}

VQ

λ = {µ}

Figure 2.4: Relationships between HMM, GMM, and VQ techniques

❄ ❄

❄ ❄ ❄

✲ ✛ ✛

Test Speech Training Speech

S{S(1),...,S(M)}

X {X(1),...,X(M)}

{P (Ci)}

1≤i≤M

{P (X|λi)}

1≤i≤M

Models{λ1,...,λM}

BayesTheory

SpeechAnalysis

SpeechAnalysis

PriorKnowledge

MAPRule

ComputeLikelihood

ParameterEstimation

{P (Ci|X)}1≤i≤M

Recognised Class (Word, Speaker)Ci∗ = arg max

1≤i≤MP (Ci|X)

Select a Modelling Technique(HMM, GMM, VQ)

Figure 2.5: A statistical classifier for isolated word recognition and speaker identifi-

cation

Page 49: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 34

a priori probability for all speakers. The latter is derived from the acoustic models,

which, in order to be practically implementable, are usually parameterised. Now

we need to solve the model parameter estimation problem. Model parameters are

determined such that the likelihood function is maximised. This is performed by the

EM algorithm. The HMM is the most effective model for solving this problem. In

the continuous case, the one-state HMM is identical to the GMM. If a hard mapping

of a vector to the Gaussian distribution is applied rather than a soft mapping in the

GMM, we obtain the VQ model and its variants. The block diagram in Figure 2.5

illustrates what we have summarised.

2.4 Fuzzy Clustering Techniques

This section begins with a brief review of fuzzy set theory and the membership func-

tion. The membership estimation problem is then mentioned. The role of cluster

analysis in pattern recognition as well as hard C-means, fuzzy C-means, noise clus-

tering, and possibilistic C-means clustering techniques are then reviewed.

2.4.1 Fuzzy Sets and the Membership Function

Fuzzy set theory was introduced in 1965 by Lotfi Zadeh to represent and manipu-

late data and information that possess nonstatistical uncertainty. Fuzzy set theory

[Zadeh 1965] is a generalisation of conventional set theory that was introduced as a

new way to represent the vagueness or imprecision that is ever present in our daily

experience as well as in natural language [Bezdek 1993].

Let X be a feature vector space. A set A is called a crisp set if every feature

vector x in X either is in A (x ∈ A) or is not in A (x 6∈ A). A set B is called a

fuzzy set in X if it is characterized by a membership function uB(x), taking values

in the interval [0, 1] and representing the “degree of membership” of x in B. With

the ordinary set A, the membership value can take on only two values 0 and 1, with

uA(x) = 1 if x ∈ A or uA(x) = 0 if x 6∈ A [Zadeh 1965]. With the fuzzy set B, the

membership function uB(x) can take any value between 0 and 1. The membership

function is the basic idea in fuzzy set theory.

Page 50: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 35

2.4.2 Maximum Membership Rule

A membership function uCi(x) can represent the degree to which an observation x

belongs to a class Ci. In order to correctly classify an unknown observation x into

one of the classes Ci, i = 1, 2, . . . ,M , the following maximum membership decision

rule can be used

C(x) = Ci if uCi(x) = max

1≤j≤MuCj

(x) (2.37)

Therefore in order to achieve the best classification, we have to decide on class Ci, if

the membership function uCi(x) is maximum [Keller et al. 1985].

2.4.3 Membership Estimation Problem

For the implementation of the maximum membership rule, the required knowledge for

an optimal classification decision is that of the membership functions. These functions

are not known in advance and have to be estimated from a training set of observations

with known class labels. The estimation procedure in fuzzy pattern recognition is

called the abstraction and the use of estimates to compute the membership values for

unknown observations not contained in the training set is called the generalisation

procedure [Bellman et al. 1966].

An estimate of the membership function is referred to as an abstracting function.

To generate a “good” abstracting function from the knowledge of its values over

a finite set of observations, we need some a priori information about the class of

functions to which the abstracting function belongs, such that this information in

combination with observations from X is sufficient for estimating. This approach

involves choosing a family of abstracting functions and finding a member of this

family which fits “best”, in some specified sense, the given observation sequence X.

In most practical situations, the a priori information about the membership function

of a fuzzy class is insufficient to generate an abstracting function, which is “optimal”

in a meaningful sense.

2.4.4 Pattern Recognition and Cluster Analysis

Pattern recognition can be characterised as “a field concerned with machine recogni-

tion of meaningful regularities in noisy or complex environments” [Duda and Hart 1973].

Page 51: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 36

A workable definition for pattern recognition is “the search for structure in data”

[Bezdek 1981]. Three main issues of the search for structure in data are: feature

selection, cluster analysis, and classification. Feature selection is the search for struc-

ture in data items, or observations xt ∈ X. The feature space X may be compressed

by eliminating redundant and unimportant features via selection or transformation.

Cluster analysis is the search for structure in data sets, or sequences X ∈ X. Since

“optimal” features are not known in advance, we often attempt to discover these by

clustering the feature variables. Finally, classification is the search for structure in

data spaces X. A pattern classifier designed for X is a device or means whereby X

itself is partitioned into “decision regions” [Bezdek 1981].

Clustering is the grouping of similar objects [Hartigan 1975]. Clustering in the

given unlabeled data X is to assign to feature vectors labels that identify “natural

subgroups” in X [Bezdek 1993]. In other words, clustering known as unsupervised

learning in X is a partitioning of X into C subsets or C clusters. The most impor-

tant requirement is to find a suitable measure of clusters, referred to as a clustering

criterion. Objective function methods allow the most precise formulation of the clus-

tering criterion. To construct an objective function, a similarity measure is required.

A standard way of expressing similarity is through a set of distances between pairs

of feature vectors. Optimising the objective function is performed to find optimal

partitions of data. The partitions generated by a clustering method define for all

data elements to which cluster they belong. The boundaries of partitions are sharp

in the hard clustering method or vague in the fuzzy clustering method. Each feature

vector of a fuzzy partition belongs to different clusters with different membership

values. Cluster validity is an important issue, which deals with the significance of the

structure imposed by a clustering method. It is required in order to determine an

optimal partition in the sense that it best explains the unknown structure in X.

2.4.5 Hard C-Means Clustering

Let U = [uit] be a matrix whose elements are memberships of xt in the ith cluster,

i = 1, . . . , C, t = 1, . . . , T . Hard C-partition space for X is the set of matrices U such

Page 52: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 37

that [Bezdek 1993]

uit ∈ {0, 1} ∀i, t,C∑

i=1

uit = 1 ∀t, 0 <T∑

t=1

uit < T ∀i (2.38)

where uit = ui(xt) is 1 or 0, according to whether xt is or is not in the ith cluster,∑Ci=1 uit = 1 ∀t means each xt is in exactly one of the C clusters, and 0 <

∑Tt=1 uit < T

∀i means that no cluster is empty and no cluster is all of X because of 2 ≤ C < T .

The HCM method [Duda and Hart 1973] is based on minimisation of the sum-of-

squared-errors function as follows

J(U, λ; X) =C∑

i=1

T∑t=1

uitd2it (2.39)

where U = {uit} is a hard C-partition of X, λ is a set of prototypes, in the simplest

case, it is the set of cluster centers: λ = {µ}, µ = {µi}, i = 1, . . . , C, and dit is the

distance in the A norm (A is any positive definite matrix) from xt to µi, known as a

measure of dissimilarity

d2it = ||xt − µi||2A = (xt − µi)

′A(xt − µi) (2.40)

Minimising the hard objective function J(U, λ; X) in (2.39) gives

uit =

{1 dit < djt j = 1, . . . , C, j 6= i

0 otherwise(2.41)

µi =T∑

t=1

uitxt

/ T∑t=1

uit (2.42)

where ties are broken randomly.

2.4.6 Fuzzy C-Means Clustering

The fuzzy C-means (FCM) method is the most widely used approach in both theory

and practical applications of fuzzy clustering techniques to unsupervised classification

[Zadeh 1977]. It is an extension of the hard C-means method that was first introduced

by Dunn [1974]. A weighting exponent m on each fuzzy membership called the degree

of fuzziness was introduced in the FCM method [Bezdek 1981] and hence a general

estimation procedure for the FCM has been established and its convergence has been

shown [Bezdek 1990, Bezdek and Pal 1992].

Page 53: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 38

Fuzzy C-Means Algorithm

Let U = [uit] be a matrix whose elements are memberships of xt in cluster i, i =

1, . . . , C, t = 1, . . . , T . Fuzzy C-partition space for X is the set of matrices U such

that [Bezdek 1993]

0 ≤ uit ≤ 1 ∀i, t,C∑

i=1

uit = 1 ∀t, 0 <T∑

t=1

uit < T ∀i (2.43)

where 0 ≤ uit ≤ 1 ∀i, t means it is possible for each xt to have an arbitrary distribution

of membership among the C fuzzy clusters.

The FCM method is based on minimisation of the fuzzy squared-error function as

follows [Bezdek 1981]

Jm(U, λ; X) =C∑

i=1

T∑t=1

umit d

2it (2.44)

where U = {uit} is a fuzzy C-partition of X, m > 1 is a weighting exponent on

each fuzzy membership uit and is called the degree of fuzziness, λ and dit are defined

as in (2.39) The basic idea of the FCM method is to minimise Jm(U, λ; X) over the

variables U and λ on the assumption that matrices U that are part of optimal pairs

for Jm(U, λ; X) identify good partitions of the data. Minimising the fuzzy objective

function Jm(U, λ; X) in (2.44) gives

uit =1

C∑k=1

(d2it/d

2kt)

1/(m−1)

(2.45)

µi =T∑

t=1

umit xt

/ T∑t=1

umit (2.46)

Gustafson-Kessel Algorithm

An interesting modification of the FCM has been proposed by Gustafson and Kessel

[1979]. It attempts to recognise the fact that different clusters in the same data set

X may have differing geometric shapes. A generalisation to a metric that appears

more natural was made through the use of a fuzzy covariance matrix. Replacing the

distance in (2.40) by an inner product induced a norm of the form

d2it = (xt − µi)

′Mi(xt − µi) (2.47)

Page 54: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 39

where the Mi, i = 1, . . . , C are symmetric and positive definite and subject to the

following constraints |Mi| = ρi, with ρi > 0 and fixed for each i. Define a fuzzy

covariance matrix Σi by

Σi =T∑

t=1

umit (xt − µi)(xt − µi)

′/ T∑

t=1

umit (2.48)

then we have M−1i = (|Mi||Σi|)−1/dΣi, i = 1, . . . , C, where |Mi| and |Σi| are the

determinants of Mi and Σi, respectively and d is the vector space dimension.

The parameter set in this algorithm is λ = {µ, Σ}, where µ = {µi} and Σ = {Σi},i = 1, . . . , C, are computed by (2.46) and (2.48), respectively.

Gath-Geva Algorithm

The algorithm proposed by Gath and Geva [1989] is an extension of the Gustafson-

Kessel algorithm that also takes the size and density of the clusters into account. The

distance is chosen to be indirectly proportional to the probability P (xt, i|λ)

d2it =

1

P (xt, i|λ)=

1

wiN(xt, µi,Σi)(2.49)

where the Gaussian distribution N(xt, µi,Σi) is defined in (2.29). The parameter

set in this algorithm is λ = {w, µ, Σ}, where µ = {µi}, Σ = {Σi} are computed as

the Gustafson-Kessel algorithm, and w = {wi}, wi are mixture weights computed as

follows

wi =T∑

t=1

umit

/ T∑t=1

C∑i=1

umit (2.50)

In contrast to the FCM and the Gustafson-Kessel algorithms, the Gath-Geva al-

gorithm is not based on an objectice function, but is a fuzzification of statistical

estimators. If we were to apply for the Gath-Geva algorithm the same technique as

for the FCM and the Gustafson-Kessel algorithms, i.e. minimising the least-squares

function in (2.44), the resulting system of equations could not be solved analytically.

In this sense, the Gath-Geva algorithm is a good heuristic on the basis of an analogy

with probability theory [Hoppner et al. 1999].

2.4.7 Noise Clustering

Both HCM and FCM clustering methods have a common disadvantage in the problem

of sensitivity to outliers. As can be seen from (2.38) and (2.43), the memberships are

Page 55: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 40

relative numbers. The sum of the memberships of a feature vector xt across classes is

always equal to one both for clean data and for noisy data, i.e. data “contaminated”

by erroneous points or “outliers”. It would be more reasonable that, if the feature

vector xt comes from noisy data or outliers, the memberships should be as small as

possible for all classes and the sum should be smaller than one. This property is

important since all parameter estimates are computed based on these memberships.

An idea of a noise cluster has been proposed by Dave [1991] to deal with noisy data

or outliers for fuzzy clustering methods.

The noise is considered to be a separate class and is represented by a prototype—

a parameter subset characterising a cluster—that has a constant distance δ from all

feature vectors. The membership u•t of a vector xt in the noise cluster is defined to

be

u•t = 1 −C∑

i=1

uit t = 1, . . . , T (2.51)

Therefore, the membership constraint for the “good” clusters is effectively relaxed to

C∑i=1

uit<1 t = 1, . . . , T (2.52)

This allows noisy data and outliers to have arbitrarily small membership values in

good clusters. The objective function in the noise clustering (NC) approach is as

follows

Jm(U, λ; X) =C∑

i=1

T∑t=1

umit d

2it +

T∑t=1

δ2(1 −

C∑i=1

uit

)m

(2.53)

where U = {uit} is a noise-clustering C-partition of X and m > 1. Since the sec-

ond term in (2.53) is independent of the parameter set λ and the distance measure,

parameters are estimated by minimising the first term—the squared-errors function

in FCM clustering (see 2.44), with respect to λ. Therefore (2.46) and (2.50) still

apply to this approach for parameter estimation. Minimising the objective function

Jm(U, λ; X) in (2.53) with respect to uit gives

uit =1

C∑k=1

(d2it/d

2kt)

1/(m−1) + (d2it/δ

2)1/(m−1)

(2.54)

The second term in the denominator of (2.54) becomes quite large for outliers, result-

ing in small membership values in all the good clusters for outliers. The advantage of

Page 56: Fuzzy Approaches to Speech and Speaker Recognition

2.4 Fuzzy Clustering Techniques 41

this approach is that it forms a more robust version of the FCM algorithm and can

be used instead of the FCM algorithm provided a suitable value for constant distance

δ can be found [Dave and Krishnapuram 1997].

2.4.8 Summary

Fuzzy set theory, membership functions and clustering techniques have been reviewed

in this Section. Figure 2.6 illustrates the clustering techniques and shows the con-

straints on memberships for each technique. Since the Gath-Geva technique is not

based on an objective function as discussed above, corresponding versions of the

Gath-Geva for the NC and the PCM techniques have not been proposed. Extended

versions of the HCM are also extensions of VQ, which have been reviewed in Section

2.3.6.

Clustering Techniques

Memberships 0 ≤ uit ≤ 1, 0 <T∑

t=1

uit < T

❄ ❄ ❄

uit ∈ {0, 1}C∑

i=1

uit = 1C∑

i=1

uit < 1

Parameter set

λ = {µ} HCM FCM NC

❄ ❄

λ = {µ,Σ} Gustafson-Kessel

ExtendedNC

λ = {µ,Σ, w} Gath-Geva

Figure 2.6: Clustering techniques and their extended versions

The fuzzy membership of a feature vector in a cluster depends not only on where

the feature vector is located with respect to the cluster, but also on how far away

it is with respect to other clusters. Therefore fuzzy memberships are spread across

Page 57: Fuzzy Approaches to Speech and Speaker Recognition

2.5 Fuzzy Approaches in the Literature 42

the classes and depend on the number of clusters present. FCM clustering has been

shown to be advantageous over hard clustering. It has become more attractive with

the connection to neural networks [Kosko 1992]. Recent advances in fuzzy clustering

have shown spectacular ability to detect not only hypervolume clusters, but also

clusters which are actually “thin shells” such as curves and surfaces [Dave 1990,

Krishnapuram et al. 1992].

However, the FCM membership values are relative numbers and thus cannot dis-

tinguish between feature vectors and outliers. It has been shown that the NC ap-

proach is quite successful in improving the robustness of a variety of fuzzy cluster-

ing algorithms. A robust-statistical foundation for the NC method was established

by Dave and Krishnapuram [1997]. Another approach is the possibilistic C-means

method, which is presented in Section 8.2.1 as a further extension of fuzzy set theory-

based clustering techniques.

2.5 Fuzzy Approaches in the Literature

This chapter presents some fuzzy approaches to speech and speaker recognition in

the literature. The first approach is to apply the maximum membership decision rule

in Section 2.4.2. The second is the use of the FCM algorithm instead of the HCM

(K-means) algorithm in coding a cepstral vector sequence X for the discrete HMM.

The third approach is to apply fuzzy rules in hybrid neuro-fuzzy systems. The last

approach is not reviewed in this section since it is out of the scope of this thesis.

The works relating this approach can be found in Kasabov [1998] and Kasabov et al.

[1999].

2.5.1 Maximum Membership Rule-Based Approach

An early application based on fuzzy set theory for decision making was proposed

[Pal and Majumder 1977]. Recognition of vowels and identifying speakers using the

first three formants (F1, F2, and F3) were implemented by using the membership

function ui(x) associated with an unknown vector x = {x1, . . . , xn} for each model

Page 58: Fuzzy Approaches to Speech and Speaker Recognition

2.5 Fuzzy Approaches in the Literature 43

λi, i = 1, . . . ,M as follows

ui(x) =1

1 + [d(x, λi)/E]F(2.55)

where E is an arbitrary positive constant, F is any integer, and d(x, λi) is the weighted

distance from vector x to the nearest prototype of model λi. Prototype points chosen

are the average of the coordinate values corresponding to the entire set of samples in

a particular class. Experiments were carried out on a set of Telugu (one of the major

Indian languages) words containing about 900 commonly used speech units for 10

vowels and uttered by three male informants in the age group of 28-30 years. Overall

recognition is about 82%.

An alternative application is the use of fuzzy algorithms for assigning phonetic

and phonemic labels to speech segments was presented in [De Mori and Laface 1980].

A method consisting of fuzzy restriction for extracting features, fuzzy relations for

relating these features with phonetic and phonemic interpretation, and their use for

interpretation of a speech pattern in terms of possibility theory has been described.

Experimental results showed an overall recognition of about 95% for 400 samples

pronounced by the four talkers.

2.5.2 FCM-Based Approach

This approach investigates the use of the fuzzy C-means algorithm instead of the

K-means (hard C-means) algorithm in coding a spectral vector sequence X for the

discrete HMM. This modification of the VQ is called the fuzzy VQ (FVQ).

Since the input of the discrete HMM is an observation sequence O = (o1 . . . oT )

consisting of discrete symbols, the spectral continuous vector sequence X = (x1 . . . xT )

needs to be transformed into the discrete symbol sequence O. This is normally per-

formed by a VQ source coding technique, where each vector xt is coded into a discrete

symbol vk—the index of the codevector closest to vector xt

ot = vk = arg min1≤i≤K

d(xt, µi) (2.56)

and the observation probability distribution B is of the form defined in (2.15).

The FVQ uses fuzzy C-partitioning on X, each vector xt belongs to classes with

corresponding memberships, thus the FVQ maps vector xt into an observation vector

Page 59: Fuzzy Approaches to Speech and Speaker Recognition

2.5 Fuzzy Approaches in the Literature 44

ot = (u1t, . . . , uCt), where uit is the membership of vector xt in class i and is computed

by using (2.45). For the observation probability distribution B = {bj(ot)}, authors

have proposed different computation methods. Following [Tseng et al. 1987], B is

computed as follows

bj(ot) =C∑

i=1

uitbij, 1 ≤ j ≤ N, 1 ≤ t ≤ T (2.57)

where bij is reestimated by

bij =

T−1∑t=1

uitαt(j)βt(j)

T−1∑t=1

αt(j)βt(j)

, 1 ≤ i ≤ C, 1 ≤ j ≤ N (2.58)

Experiments were conducted to compare three cases: using VQ/HMM, using

FVQ/HMM for training only, and using FVQ/HMM for both training and recog-

nition, where the HMMs are 5-state left-right ones. The highest isolated-word recog-

nition rates for the three cases are 72%, 77%, and 77%, respectively, where the degree

of fuzziness is m = 1.25, 10 training utterances are used, and the vocabulary is the

E-set consisting of 9 English letters {b, c, d, e, g, p, t, v, z}.To obtain more tractable computation, Tsuboka and Nakahashi [1994] have pro-

posed two alternative methods

1. Multiplication-type FVQ:

bj(ot) =C∏

i=1

buitij , 1 ≤ j ≤ N, 1 ≤ t ≤ T (2.59)

and bij is computed as in (2.58)

2. Addition-type FVQ:

bj(ot) =C∑

i=1

uitbij, 1 ≤ j ≤ N, 1 ≤ t ≤ T (2.60)

and bij is computed as

bij =T−1∑t=1

ζij(t), ζij(t) =αt(j)βt(j)

T−1∑t=1

αt(j)βt(j)

× uitbij

C∑i=1

uitbij

, 1 ≤ i ≤ C, 1 ≤ j ≤ N

(2.61)

Page 60: Fuzzy Approaches to Speech and Speaker Recognition

2.5 Fuzzy Approaches in the Literature 45

It was reported that for practical applications the multiplication type is more suitable

than the addition type. In isolated-word recogniton experiments, the number of states

of each HMM was set to be 1/5 of the average length in frames of training data. The

vocabulary is 100 city names in Japan and the degree of fuzziness is m = 2. The

highest recognition rates reached with a codebook size of 256 were 98.5% for the

multiplication type, 98.2% for the addition type, and 97.5% for the VQ/HMM.

An extension of this approach has been proposed by Chou and Oh [1996] where

a distribution normalisation dependent on the codevectors and a fuzzy contribution

based on weighting and smoothing the codevectors by distance have been applied.

Page 61: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 3

Fuzzy Entropy Models

This chapter proposes a new fuzzy approach to speech and speaker recognition as well

as to cluster analysis in pattern recognition. Models developed in this approach can be

called fuzzy entropy models since they are based on a basic algorithm called fuzzy entropy

clustering. The goal of this approach is not only to propose a new fuzzy method but

also to show that statistical models, such as HMMs in the maximum likelihood scheme,

can be viewed as fuzzy models, where probabilities of unobservable data given observable

data are used as fuzzy membership functions. An introduction of fuzzy entropy clustering

techniques is presented in Section 3.1. Relationships between clustering and modelling

problems are shown in Section 3.2. Section 3.3 presents an optimisation criterion proposed

as maximum fuzzy likelihood and formulates the fuzzy EM algorithm. Fuzzy entropy

models for HMMs, GMMs and VQ are presented in the next Sections 3.4, 3.5 and 3.6,

respectively. The noise clustering approach is also considered for these models. Section

3.7 presents a comparison between conventional models and fuzzy entropy models.

3.1 Fuzzy Entropy Clustering

Let us consider the following function [Tran and Wagner 2000f]

Hn(U, λ; X) =C∑

i=1

T∑t=1

uitd2it + n

C∑i=1

T∑t=1

uit log uit (3.1)

where n > 0, λ is the model parameter set, dit is the distance between vector xt

and cluster i, and U = [uit] with uit being the membership of vector xt in cluster i.

46

Page 62: Fuzzy Approaches to Speech and Speaker Recognition

3.1 Fuzzy Entropy Clustering 47

Assuming that the matrices U satisfy the following conditions

C∑i=1

uit = 1 ∀t, 0 <T∑

t=1

uit < T ∀i (3.2)

which mean that each xt belongs to C clusters, no cluster is empty and no cluster is

all of X because of 2 ≤ C < T .

We wish to show that minimising the function Hn(U, λ; X) on U yields solutions

uit ∈ [0, 1], hence all constraints in (2.43) are satisfied. This means that the matrices

U determine the fuzzy C-partition space for X and Hn(U, λ; X) is a fuzzy objective

function.

The first term on the right-hand side in (3.1) is the sum-of-squared-errors function

J1(U, λ; X) defined in (2.39) for hard C-means clustering. The second term is the

negative of the following function E(U) multiplied by n

E(U) = −C∑

i=1

T∑t=1

uit log uit (3.3)

The function E(U) is maximum if uit = 1/C ∀i, and minimum if uit = 1 or 0. On

the other hand, the function J1(U, λ; X) needs to be minimised to obtain a good

partition for X. Therefore, we can see that uit can take values in the interval [0, 1] if

the function Hn(U, λ; X) is minimised over U . Indeed, with the assumption in (3.2),

the Lagrangian H∗n(U, λ; X) is of the form

H∗n(U, λ; X) =

C∑i=1

T∑t=1

uitd2it + n

C∑i=1

T∑t=1

uit log uit +C∑

i=1

ki(T∑

t=1

uit − 1) (3.4)

H∗n(U, λ; X) is minimised by setting its gradients with respect to U and the Lagrange

multipliers {ki} to zerod2

it + n(1 + log uit) + ki = 0 ∀i, tC∑

i=1

uit = 1 ∀t(3.5)

This is equivalent to

uit = Sie−d2

it/n, Si = e−1 − (ki/n) (3.6)

Using the constraint in (3.5), we can compute Si and hence

uit =e−d2

it/n

C∑k=1

e−d2kt/n

(3.7)

Page 63: Fuzzy Approaches to Speech and Speaker Recognition

3.1 Fuzzy Entropy Clustering 48

From (3.7), it can be seen that 0 ≤ uit ≤ 1. Therefore, the matrices U determine

a fuzzy C-partition space for X. In this case, the function E(U) is called the fuzzy

entropy function which has been considered by many authors. Clusters are considered

as fuzzy sets and the fuzzy entropy function expresses the uncertainty of determining

whether xt belongs to a given cluster or not. Measuring the degree of uncertainty

of fuzzy sets themselves was first proposed by De Luca and Termini [1972]. In other

words, the function E(U) expresses the average degree of nonmembership of members

in a fuzzy set [Li and Mukaidono 1999]. The function E(U) was also considered by

Hathaway [1986] for mixture distributions in relating the EM algorithm to clustering

techniques. For the function Hn(U, λ; X) in (3.1), the function E(U) is employed to

“pull” memberships away from values equal to 0 or 1.

Based on the above discussions, this clustering technique is called fuzzy entropy

clustering [Tran and Wagner 2000f] to distinguish it from FCM clustering that has

been reviewed in Chapter 2. In general, the task of fuzzy entropy clustering is to min-

imise the fuzzy objective function Hn(U, λ; X) over variables U and λ, namely, finding

a pair of (U, λ) such that Hn(U, λ; X) ≤ Hn(U, λ; X). This task is implemented by

an iteration of the two steps: 1) Finding U such that Hn(U, λ; X) ≤ Hn(U, λ; X),

and 2) Finding λ such that Hn(U, λ; X) ≤ Hn(U, λ; X).

U is obtained by using the solution in (3.7), which can be presented in a similar

form to FCM clustering:

uit =

[C∑

j=1

(ed

2it

/ed

2jt

)1/n]−1

(3.8)

Since the function E(U) in (3.1) is not dependent on dit, determining λ is per-

formed by minimising the first term, that is the function J1(U, λ; X). Thus the

parameter estimation equations are identical to those in HCM clustering.

For the Euclidean distance d2it = (xt − µi)

2, we obtain [Tran and Wagner 2000g]

µi =T∑

t=1

uitxt

/ T∑t=1

uit (3.9)

For the Mahalanobis distance d2it = (xt − µi)

′Σ−1i (xt − µi) , we obtain

µi =T∑

t=1

uitxt

/ T∑t=1

uit, Σi =T∑

t=1

uit(xt − µi)(xt − µi)′/ T∑

t=1

uit (3.10)

Page 64: Fuzzy Approaches to Speech and Speaker Recognition

3.1 Fuzzy Entropy Clustering 49

The function Hn(U, λ; X) also has a physical interpretation. In statistical physics,

Hn(U, λ; X) is known as free energy, the first term in Hn(U, λ; X) is the expected

energy under U and the second one is the entropy of U [Jaynes 1957]. The expression

of uit in (3.8) is of the form of the Boltzmann distribution exp(−ɛ/kBτ), a special

case of the Gibbs distribution, where ɛ is the energy, kB is the Boltzmann constant,

and τ is the temperature. Based on this property, we can apply a simulated annealing

method [Otten and Ginnenken 1989] to find a global minimum solution for λ (the way

that liquids freeze and crystallise in thermodynamics) by decreasing the temperature

τ , i.e. decreasing the value of n.

The degree of fuzzy entropy n determines the partition of X. As n → ∞, we

have uit → (1/C), each feature vector is equally assigned to C clusters, so we have

only a single cluster. As n → 0, uit → 0 or 1, and the function Hn(U, λ; X) ap-

proaches J1(U, λ; X), it can be said that FE clustering reduces to HCM clustering

[Tran and Wagner 2000f]. Figure 3.1 illustrates the generation of clusters with dif-

ferent values of n. On the other hand, for data having well-separated clusters with

Figure 3.1: Generating 3 clusters with different values of n: hard clustering as n → 0,

clusters increase their overlap with increasing n > 0, and are identical to a single

cluster as n → ∞

memberships converging to the values 0 or 1, the fuzzy entropy term approaches 0

due to 1 log 1 = 0 log 0 = 0, and the FE function itself reduces to the HCM function

for all n > 0.

Page 65: Fuzzy Approaches to Speech and Speaker Recognition

3.2 Modelling and Clustering Problems 50

3.2 Modelling and Clustering Problems

To apply FE clustering to statistical modelling techniques, we need to determine

relationships between the modelling and clustering problems. The first task for solving

modelling and clustering problems is to establish an optimisation criterion known as

an objective function. For modelling purposes, optimising the objective function

is to find the right parametric form of the distributions. For clustering purposes,

the optimisation is to find optimal partitions of data. Clustering is a geometric

method, where considering data involves considering shapes and locations of clusters.

In statistical modelling, data structure can be described by considering data density.

Instead of finding clusters, we find high data density areas, and thus the consideration

of data structure involves considering data distributions via the use of statistical

distribution functions. A mixture of normal distributions, or Gaussians, is effective

in the approximate description of a complicated distribution. Moreover, an advantage

of statistical modelling is that it can effectively express the temporal structure of data

through the use of a Markov process, a problem which is not addressed by clustering.

It would be useful if we could take advantages of both methods in a single ap-

proach. In order to implement this, we first define a general distance dXY for clus-

tering. It denotes a dissimilarity between observable data X and unobservable data

(cluster, state) Y as a decreasing function of the distribution of X on component Y ,

given a model λ

d2XY = − log P (X,Y |λ) (3.11)

This distance is used to relate the clustering problem to the statistical modelling

problem as well as the minimum distance rule to the maximum likelihood rule. In-

deed, since minimising this distance leads to maximising the component distribution

P (X,Y |λ), grouping similar data points into a cluster by the minimum distance rule

thus becomes grouping these into a component distribution by the maximum like-

lihood rule. Clusters are now represented by component distribution functions and

hence the characteristics of a cluster are not only its shape and location, but also the

data density in the cluster and, possibly, the temporal structure of data if the Markov

process is also applied [Tran and Wagner 2000a].

Page 66: Fuzzy Approaches to Speech and Speaker Recognition

3.3 Maximum Fuzzy Likelihood Estimation 51

3.3 Maximum Fuzzy Likelihood Estimation

The distance defined in (3.11) is used to transform the clustering problem to a mod-

elling problem. Using this distance, we can relate the FE function in (3.1) to the

likelihood function. For example, we consider the case that feature vectors are as-

sumed to be statistically independent. Using Jensen’s inequality [Ghahramani 1995]

for the log-likelihood L(λ; X), we can show that

L(λ; X) = log P (X|λ) = logT∏

t=1

P (xt|λ) =T∑

t=1

log P (xt|λ)

=T∑

t=1

logC∑

i=1

P (xt, i|λ) =T∑

t=1

logC∑

i=1

uitP (xt, i|λ)

uit

≥T∑

t=1

C∑i=1

uit logP (xt, i|λ)

uit

≥T∑

t=1

C∑i=1

uit log P (xt, i|λ) −T∑

t=1

C∑i=1

uit log uit (3.12)

On the other hand, according to (3.11), replacing the distance

d2it = − log P (xt, i|λ) (3.13)

into the FE function in (3.1) and into the membership in (3.8), we obtain

Hn(U, λ; X) = −C∑

i=1

T∑t=1

uit log P (xt, i|λ) + nC∑

i=1

T∑t=1

uit log uit (3.14)

and uit ={ C∑

j=1

[P (xt, i|λ)/P (xt, j|λ)]1/n}−1

(3.15)

From (3.12), (3.14) and (3.15) we can show that

L(λ; X) ≥ −H1(U, λ; X) (3.16)

and L(λ; X) = −H1(U, λ; X) (3.17)

where according to (3.15) we obtain U = {uit}, uit = P (i|xt, λ) as n = 1. The equality

in (3.17) shows that, if we find λ such that H1(U, λ; X) ≤ H1(U, λ; X) then we will

have L(λ; X) ≥ L(λ; X). It means that, as n = 1, minimising the FE function in (3.1)

using the distance in (3.13) leads to maximising the likelihood function. Therefore,

Page 67: Fuzzy Approaches to Speech and Speaker Recognition

3.4 Fuzzy Entropy Hidden Markov Models 52

we can define a function Ln(U, λ; X) as follows [Tran and Wagner 2000a]

Ln(U, λ; X) = −Hn(U, λ; X) =T∑

t=1

C∑i=1

uit log P (xt, i|λ) − nT∑

t=1

C∑i=1

uit log uit (3.18)

From the above consideration, Ln(U, λ; X) can be called the fuzzy likelihood function.

Maximising this function (also minimising the FE function) is implemented by the

fuzzy EM algorithm, which is different from the standard EM algorithm in the E-step.

The fuzzy EM algorithm can be formulated as follows

Algorithm 2 (The Fuzzy EM Algorithm)

1. Initialisation: Fix n and choose an initial estimate λ

2. Fuzzy E-step: Compute U and Ln(U, λ; X)

3. M-step: Use a certain optimisation method to determine λ , for which Ln(U, λ; X)

is maximised

4. Termination: Set λ = λ and U = U , repeat from the E-step until the change of

Ln(U, λ; X) falls below a preset threshold.

3.4 Fuzzy Entropy Hidden Markov Models

This section is to apply the proposed fuzzy methods to the parameter estimation

problem for the fuzzy entropy HMM (FE-HMM). The fuzzy EM algorithm can be

viewed as a generalised Baum-Welch algorithm. FE-HMMs reduce to conventional

HMMs as the degree of fuzzy entropy n = 1.

3.4.1 Fuzzy Membership Functions

Fuzzy sets in the fuzzy HMM are determined in this section to compute the matrices U

for the fuzzy EM algorithm in Section 3.3. In the conventional HMM, each observation

ot is in each of N possible states at time t with a corresponding probability. In the

fuzzy HMM, each observation ot is regarded as being in N possible states at time t

with a corresponding degree of belonging known as the fuzzy membership function.

A state at time t is thus considered as a time-dependent fuzzy set or fuzzy state st.

Page 68: Fuzzy Approaches to Speech and Speaker Recognition

3.4 Fuzzy Entropy Hidden Markov Models 53

Fuzzy states s1 s2 . . . sT are also considered as a fuzzy state sequence S. There are

N fuzzy states at each time t = 1, . . . , T , and a total of NT possible fuzzy state

sequences in the fuzzy HMM. Figure 3.2 illustrates all fuzzy states as well as fuzzy

state sequences in the HMM.

Figure 3.2: States at each time t = 1, . . . , T are regarded as time-dependent fuzzy

sets. There are N×T fuzzy states connected by arrows into NT fuzzy state sequences

in the fuzzy HMM.

On the other hand, the observations are always considered in the sequence O and

related to the state sequence S. Therefore we define the fuzzy membership function of

the sequence O in fuzzy state sequence S based on the fuzzy membership function of

the observation ot in the fuzzy state st. For example, the fuzzy membership ust=i(O)

denotes the degree of belonging of the observation sequence O to fuzzy state sequences

being in fuzzy state st = i at time t, where i = 1, . . . , N . For computing the state

transition matrix A, we consider 2N fuzzy states at time t and time t + 1 included in

2N fuzzy state sequences and define the fuzzy membership function ust=i st+1=j(O).

This membership denotes the degree of belonging of the observation sequence O to

fuzzy state sequences being in fuzzy state st = i at time t and fuzzy state st+1 = j at

time t+1, where i, j = 1, . . . , N . For simplicity, this membership can be rewritten as

Page 69: Fuzzy Approaches to Speech and Speaker Recognition

3.4 Fuzzy Entropy Hidden Markov Models 54

uijt(O) or uijt. Figure 3.3 illustrates such fuzzy state sequences in the fuzzy HMM.

Figure 3.3: The observation sequence O belongs to fuzzy state sequences being in

fuzzy state i at time t and fuzzy state j at time t + 1.

Similarly, we can determine fuzzy sets for computing the parameters {w, µ, Σ} in

the fuzzy continuous HMM, where the observation sequence O is the vector sequence

X. Fuzzy sets are fuzzy states and fuzzy mixtures at time t. The fuzzy membership

function ust=j mt=k(X), or ujkt for simplicity, denotes the degree of belonging of the

observation sequence X to fuzzy state j and fuzzy mixture k at time t as illustrated

in Figure 3.4.

3.4.2 Fuzzy Entropy Discrete HMM

From (3.18), the fuzzy likelihood function for the fuzzy entropy discrete HMM (FE-

DHMM) is proposed as follows [Tran and Wagner 2000a]

Ln(U, λ; O) = −T−1∑t=0

∑st

∑st+1

ustst+1d2stst+1

− nT−1∑t=0

∑st

∑st+1

ustst+1 log ustst+1 (3.19)

where n > 0, ustst+1 = ustst+1(O) and d2stst+1

= − log P (O, st, st+1|λ). Note that πs1

is denoted by as0s1 in (3.19) for simplicity. Assuming that we are in state i at time t

Page 70: Fuzzy Approaches to Speech and Speaker Recognition

3.4 Fuzzy Entropy Hidden Markov Models 55

and state j at time t + 1, the function Ln(U, λ; O) can be rewritten as follows

Ln(U, λ; O) = −T−1∑t=0

N∑i=1

N∑j=1

uijtd2ijt − n

T−1∑t=0

N∑i=1

N∑j=1

uijt log uijt (3.20)

where

d2ijt = − log P (O, st = i, st+1 = j|λ) = − log

[αt(i)aijbj(ot+1)βt+1(j)

](3.21)

uijt = uijt(O) is the fuzzy membership function denoting the degree to which the

observation sequence O belongs to the fuzzy state sequences being in state i at time

t and state j at time t + 1. From the definition of the fuzzy membership (2.43), we

obtain

0 ≤ uijt ≤ 1 ∀i, j, t,N∑

i=1

N∑j=1

uijt = 1 ∀t, 0 <T∑

t=1

uiit < T ∀i

(3.22)

The inequalities in 0 <∑T

t=1 uiit < T mean that the state sequences having only one

state do not occur.

Fuzzy E-Step: Since maximising the function Ln(U, λ; O) on U is also minimising

the corresponding function Hn(U, λ; O), the solution U in (3.8) is used. The distance

Figure 3.4: The observation sequence X belongs to fuzzy state j and fuzzy mixture

k at time t in the fuzzy continuous HMM.

Page 71: Fuzzy Approaches to Speech and Speaker Recognition

3.4 Fuzzy Entropy Hidden Markov Models 56

is defined in (3.21). We obtain

uijt =e−d2

ijt/n

N∑k=1

N∑l=1

e−d2klt/n

=[P (O, st = i, st+1 = j|λ)]1/n

N∑k=1

N∑l=1

[P (O, st = k, st+1 = l|λ)]1/n

(3.23)

M-step: Note that the second term of the function Ln(U, λ; O) is not depen-

dent on λ, therefore maximising Ln(U, λ; O) over λ is equivalent to maximising the

following function

L∗n(U, λ; O) = −

T−1∑t=0

N∑i=1

N∑j=1

uijtd2ijt (3.24)

Replacing the distance (3.21) into (3.24), we can regroup the function in (3.24) into

four terms as follows

L∗n(U, λ; O) =

N∑j=1

( N∑i=1

uij0

)log πj +

N∑i=1

N∑j=1

( T−1∑t=1

uijt

)log aij

+N∑

j=1

K∑k=1

( T∑t=1

s.t. ot=vk

N∑i=1

uijt

)log bj(k)

+N∑

i=1

N∑j=1

T−1∑t=1

uijt log [αt(i)βt+1(j)] (3.25)

where the last term including αt(i)βt+1(j) can be ignored, since the forward-backward

variables can be computed from π,A,B by the forward-backward algorithm (see Sec-

tion 2.3.4). Maximising the function L∗n(U, λ; O) on π,A,B is performed by using

Lagrange multipliers and the following constraints

N∑j=1

πj = 1,N∑

j=1

aij = 1,K∑

k=1

bj(k) = 1 (3.26)

We obtain the parameter reestimation equations as follows [Tran 1999]

πj =N∑

i=1

uij0, aij =

T−1∑t=1

uijt

T−1∑t=1

N∑j=1

uijt

, bj(k) =

T∑t=1

s.t. ot=vk

N∑i=1

uijt

T∑t=1

N∑i=1

uijt

(3.27)

In the case of n = 1, the membership function in (3.23) becomes

uijt =P (O, st = i, st+1 = j|λ)

N∑k=1

N∑l=1

P (O, st = k, st+1 = l|λ)

= P (st = i, st+1 = j|O, λ) = ξt(i, j) (3.28)

Page 72: Fuzzy Approaches to Speech and Speaker Recognition

3.4 Fuzzy Entropy Hidden Markov Models 57

where ξt(i, j) is defined in (2.20). The parameter reestimation equations in (3.27) are

now identical to those obtained by the Baum-Welch algorithm in Section 2.3.4.

3.4.3 Fuzzy Entropy Continuous HMM

Similarly, ujkt = ujkt(X) is defined as the fuzzy membership function denoting the

degree to which the vector sequence X belongs to fuzzy state st = i and fuzzy

Gaussian mixture mt = k at time t, satisfying

0 ≤ ujkt ≤ 1 ∀j, k, t,N∑

j=1

K∑k=1

ujkt = 1 ∀t, 0 <T∑

t=1

ujkt < T ∀j, k

(3.29)

and the distance djkt is of the form

d2jkt = − log P (X, st = j,mt = k|λ) = − log

[ N∑i=1

αt−1(i)aijwjkN(xt, µjk, Σjk)βt(j)]

(3.30)

We obtain the fuzzy EM algorithm for the fuzzy entropy continuous HMM (FE-

CHMM) as follows [Tran and Wagner 2000a]

Fuzzy E-Step:

ujkt =e−d2

jkt/n

N∑i=1

K∑l=1

e−d2ilt/n

=[P (X, st = j,mt = k|λ)]1/n

N∑i=1

K∑l=1

[P (X, st = i,mt = l|λ)]1/n

(3.31)

M-step: Similar to the continuous HMM, the parameter estimation equations for

the π and A distributions are unchanged, but the output distribution B is estimated

via Gaussian mixture parameters (w, µ, Σ) as follows

wjk =

T∑t=1

ujkt

T∑t=1

K∑k=1

ujkt

, µjk =

T∑t=1

ujktxt

T∑t=1

ujkt

, Σjk =

T∑t=1

ujkt(xt − µjk)(xt − µjk)′

T∑t=1

ujkt

(3.32)

In the case of n = 1, the membership function in (3.31) becomes

ujkt =P (X, st = j,mt = k|λ)

N∑i=1

N∑l=1

P (X, st = i,mt = l|λ)

= P (st = j,mt = k|X,λ) = ηt(j, k) (3.33)

Page 73: Fuzzy Approaches to Speech and Speaker Recognition

3.4 Fuzzy Entropy Hidden Markov Models 58

where ηt(j, k) is defined in (2.22). The parameter reestimation equations in (3.32)

are now identical to those obtained by the Baum-Welch algorithm in Section 2.3.4.

3.4.4 Noise Clustering Approach

The speech signal is influenced by the speaking environment, the transmission chan-

nel, and the transducer used to capture the signal. So there exist some bad observa-

tions regarded as outliers, which influence speech recognition performance. For the

fuzzy entropy HMM in the noise clustering approach (NC-FE-HMM), a separate state

is used to represent outliers and is termed the garbage state [Tran and Wagner 1999a].

This state has a constant distance δ from all observation sequences. The membership

u•t of an observation sequence O at time t in the garbage state is defined to be

u•t = 1 −N∑

i=1

N∑j=1

uijt 1 ≤ t ≤ T (3.34)

Therefore, the membership constraint for the “good” states is effectively relaxed to

N∑i=1

N∑j=1

uijt < 1 1 ≤ t ≤ T (3.35)

This allows noisy data and outliers to have arbitrarily small membership values in

good states. The fuzzy likelihood function for the FE-DHMM in the NC approach

(NC-FE-DHMM) is as follows

Ln(U, λ; O) = −T−1∑t=0

N∑i=1

N∑j=1

uijtd2ijt − n

T−1∑t=0

N∑i=1

N∑j=1

uijt log uijt

−T−1∑t=0

u•tδ2 − n

T−1∑t=0

u•t log u•t (3.36)

Replacing u•t in (3.34) into (3.36) and maximising the fuzzy likelihood function over

U , we obtain [Tran and Wagner 2000a]

Fuzzy E-Step:

uijt =1

N∑k=1

N∑l=1

(ed2ijt/ed2

klt)1/n + (ed2ijt/eδ2

)1/n

=[P (O, st = i, st+1 = j|λ)]1/n

N∑k=1

N∑l=1

[P (O, st = k, st+1 = l|λ)]1/n + e−δ2/n

(3.37)

Page 74: Fuzzy Approaches to Speech and Speaker Recognition

3.5 Fuzzy Entropy Gaussian Mixture Models 59

M-Step: The M-step is identical to the M-step of the FE-DHMM in Section

3.4.2.

where the distance dijt is computed as in (3.21). The second term in the denominator

of (3.37) becomes quite large for outliers, resulting in small membership values in all

the good states for outliers. The advantage of this approach is that it forms a more

robust version of the fuzzy entropy algorithm and can be used instead of the fuzzy

entropy algorithm provided a suitable value for the constant distance δ can be found.

Similarly, the FE-CHMM in the NC approach (NC-FE-CHMM) is as follows

Fuzzy E-Step:

ujkt =1

N∑i=1

K∑l=1

(ed2jkt/ed2

ilt)1/n + (ed2jkt/eδ2

)1/n

=[P (X, st = j,mt = k|λ)]1/n

N∑i=1

K∑l=1

[P (X, st = i,mt = l|λ)]1/n + e−δ2/n

(3.38)

M-Step: The M-step is identical to the M-step of the FE-CHMM in Section 3.4.3.

3.5 Fuzzy Entropy Gaussian Mixture Models

Although we can obtain equations for the fuzzy entropy GMM (FE-GMM) from the

FE-CHMM with the number of states set to N = 1, for practical applications, they

are summarised as follows.

3.5.1 Fuzzy Entropy GMM

For a training vector sequence X = (x1x2 . . .xT ), the fuzzy likelihood of the FE-GMM

is

Ln(U, λ; X) = −K∑

i=1

T∑t=1

uitd2it − n

K∑i=1

T∑t=1

uit log uit (3.39)

where

d2it = − log P (xt, i|λ) = − log [wiN(xt, µi,Σi)] (3.40)

Page 75: Fuzzy Approaches to Speech and Speaker Recognition

3.5 Fuzzy Entropy Gaussian Mixture Models 60

and uit = ui(xt) is the fuzzy membership function denoting the degree to which

feature vector xt belongs to fuzzy Gaussian mixture i, satisfying

0 ≤ uit ≤ 1 ∀i, t,K∑

i=1

uit = 1 ∀t, 0 <T∑

t=1

uit < T ∀i (3.41)

Fuzzy E-Step: Maximising the fuzzy likelihood function on U gives

uit =e−d2

it/n

K∑k=1

e−d2kt/n

=[P (xt, i|λ)]1/n

K∑k=1

[P (xt, k|λ)]1/n

(3.42)

M-Step: Maximising the fuzzy likelihood function on λ gives

wi =

T∑t=1

uit

T∑t=1

K∑k=1

ukt

, µi =

T∑t=1

uitxt

T∑t=1

uit

, Σi =

T∑t=1

uit(xt − µi)(xt − µi)′

T∑t=1

uit

(3.43)

Again if n = 1 we obtain

uit =P (xt, i|λ)

K∑k=1

P (xt, k|λ)

= P (i|xt, λ) (3.44)

and the FE-GMM reduces to the conventional GMM.

3.5.2 Noise Clustering Approach

The fuzzy likelihood function for the FE-GMM in the NC approach (NC-FE-GMM)

is as follows

Ln(U, λ; X) = −T∑

t=1

K∑i=1

uitd2it − n

T−1∑t=1

K∑i=1

uit log uit

−T∑

t=1

u•tδ2 − n

T∑t=1

u•t log u•t (3.45)

where u•t = 1 − ∑Ki=1 uit, 1 ≤ t ≤ T . Maximising the fuzzy likelihood function on U

and λ gives

Fuzzy E-Step:

uit =e−d2

it/n

K∑k=1

e−d2kt/n + e−δ2/n

=[P (xt, i|λ)]1/n

K∑k=1

[P (xt, k|λ)]1/n + e−δ2/n

(3.46)

M-Step: the M-step is identical to the M-step of the FE-GMM in Section 3.5.1.

Page 76: Fuzzy Approaches to Speech and Speaker Recognition

3.6 Fuzzy Entropy Vector Quantisation 61

3.6 Fuzzy Entropy Vector Quantisation

Reestimation algorithms for fuzzy entropy VQ (FE-VQ) are derived from the al-

gorithms for FE-GMMs by using the Euclidean or the Mahalanobis distances. We

obtain the following [Tran and Wagner 2000g, Tran and Wagner 2000h]:

Fuzzy E-Step: Choose one of the following cases

• FE-VQ:

uit =1

K∑k=1

exp(

d2it − d2

kt

n

) (3.47)

• FE-VQ with noise clusters (NC-FE-VQ):

uit =1

K∑k=1

exp(

d2it − d2

kt

n

)+ exp

(d2

it − δ2

n

) (3.48)

M-Step: Choose one of the following cases

• Using the Euclidean distance: d2it = (xt − µi)

2

µi =

T∑t=1

uitxt

T∑t=1

uit

(3.49)

• Using the Mahalanobis distance: d2it = (xt − µi)

′Σ−1i (xt − µi)

µi =

T∑t=1

uitxt

T∑t=1

uit

, Σi =

T∑t=1

uit(xt − µi)(xt − µi)′

T∑t=1

uit

(3.50)

3.7 A Comparison Between Conventional and Fuzzy

Entropy Models

It may be useful for applications to consider the differences between the conventional

models and FE models. In general, the difference is mainly the conventional E-

step and the fuzzy E-step in the parameter reestimation procedure. A weighting

Page 77: Fuzzy Approaches to Speech and Speaker Recognition

3.7 A Comparison Between Conventional and Fuzzy Entropy Models 62

exponent 1/n on each joint probability P (X,Y |λ) between observable data X and

unobservable data Y is introduced in the FE models. If n = 1, the FE models reduce

to the conventional models, as shown in Figure 3.5

uY (X) =

[P (X,Y |λ)

]1/n∑Y

[P (X,Y |λ)

]1/n✲n = 1

uY (X) =P (X,Y |λ)∑

Y

P (X,Y |λ)= P (Y |X,λ)

Fuzzy Entropy Models Conventional Models

Figure 3.5: From fuzzy entropy models to conventional models

The role of the degree of fuzzy entropy n can be considered in depth via its

influence on the parameter reestimation equations. Without loss of generality, let

us consider a problem of GMMs. Given a vector xt, let us assume that the cluster

i among K clusters has the highest component density P (xt, i|λ), i.e. the shortest

distance d2it = − log P (xt, i|λ). Consider the membership uit of the fuzzy entropy

GMM with n > 1. From (3.42), it can be rewritten as

uit =e−d2

it/n

K∑k=1

e−d2kt/n

=[P (xt, i|λ)]1/n

K∑k=1

[P (xt, k|λ)]1/n

=[P (xt, i|λ)]1/n

[P (xt, i|λ)]1/n +K∑

k=1k 6=i

[P (xt, k|λ)]1/n

=1

1 +K∑

k=1k 6=i

[P (xt, k|λ)

P (xt, i|λ)

]1/n(3.51)

For all k = 1, . . . , K and k 6= i, we obtain the following equivalent inequalities:

P (xt, k|λ) < P (xt, k|λ)

⇔ P (xt, k|λ)

P (xt, i|λ)< 1

⇔ P (xt, k|λ)

P (xt, i|λ)<

[P (xt, k|λ)

P (xt, i|λ)

]1/n

since n > 1

Page 78: Fuzzy Approaches to Speech and Speaker Recognition

3.7 A Comparison Between Conventional and Fuzzy Entropy Models 63

⇔ 1

1 +K∑

k=1k 6=i

[P (xt, k|λ)

P (xt, i|λ)

]1/n<

1

1 +K∑

k=1k 6=i

P (xt, k|λ)

P (xt, i|λ)

⇔ uit < P (i|xt, λ) (3.52)

The same can be shown easily for the remaining cases. In general, we obtain

• P (xt, i|λ) ≥ P (xt, k|λ) ∀k 6= i : xt is closest to cluster i

uit

> P (i|xt, λ) 0 < n < 1

= P (i|xt, λ) n = 1

< P (i|xt, λ) n > 1

(3.53)

• P (xt, i|λ) ≤ P (xt, k|λ) ∀k 6= i : xt is furthest from cluster i

uit

< P (i|xt, λ) 0 < n < 1

= P (i|xt, λ) n = 1

> P (i|xt, λ) n > 1

(3.54)

As discussed above, the distance between vector xt and cluster i is a monotonically

decreasing function of the joint probability P (xt, i|λ) (see (3.40)), therefore we can

have an interpretation for the parameter n. If 0 < n < 1, comparing with the a

posteriori probability P (i|xt, λ) in the GMM, the degree of belonging uit of vector

xt to cluster i is higher than P (i|xt, λ) if xt is close to cluster i and is lower than

P (i|xt, λ) if xt is far from cluster i. The reverse result is obtained for n > 1. Since

model parameters λ = {w, µ, Σ} are determined by memberships (see (3.43) in the

M-step), we can expect that the use of the parameter n will yield a better parametric

form for the GMMs. As discussed in Section 2.3.2, when the amount of the training

data is insufficient, the quality of the distribution parameter estimates cannot be

guaranteed. In reality, this problem often occurs. Therefore fuzzy entropy models

may enhance the above quality by their adjustable parameter n. If we wish to decrease

the influence of vectors far from cluster center, we reduce the value of n to less than 1.

Inversely, values of n greater than 1 increase the influence of those vectors. In general,

there does not exist a best value of n in all cases. For different applications and data,

suitable values of n may be different. The membership function with different values

of n versus the distance between vector xt and cluster i is shown in Figure 3.6. The

limit of n = 0 represents hard models, which will be presented in Chapter 5.

Page 79: Fuzzy Approaches to Speech and Speaker Recognition

3.8 Summary and Conclusion 64

Figure 3.6: The membership function uit with different values of the degree of fuzzy

entropy n versus the distance dit between vector xt and cluster i

For example, consider the case of 4 clusters. Given P (xt, i|λ), i = 1, 2, 3, 4, we

compute uit with n = 1 for the GMM, and with n = 0.5 and n = 2 for the FE-GMM.

Table 3.1 shows these values for comparison.

Cluster i i = 1 i = 2 i = 3 i = 4

Given P (xt, i|λ) 0.0016 0.0025 0.0036 0.0049

uit = P (i|xt, λ) for GMM (n = 1.0) 0.13 0.20 0.28 0.39

uit for fuzzy entropy GMM (n = 0.5) 0.06 0.14 0.28 0.52

uit for fuzzy entropy GMM (n = 2.0) 0.18 0.23 0.27 0.32

Table 3.1: An example of memberships for the GMM and the FE-GMM

3.8 Summary and Conclusion

Fuzzy entropy models have been presented in this chapter. Relationships between

fuzzy entropy models are summarised in Figure 3.7 below. A parameter is introduced

for the degree of fuzzy entropy n > 0. With n → 0, we obtain hard models, which

will be presented in Chapter 5. With n → ∞, we obtain maximally fuzzy entropy

Page 80: Fuzzy Approaches to Speech and Speaker Recognition

3.8 Summary and Conclusion 65

models, equivalent to only a single state or a single cluster. With n = 1, fuzzy entropy

models reduce to conventional models in the maximum likelihood scheme. This result

shows that the statistical models can be viewed as special cases of fuzzy models. An

advantage obtained from this viewpoint is that we can get ideas from fuzzy methods

and apply them to statistical models. For example, by letting n = 1 in the noise

clustering approach, we obtain new models for HMMs, GMMs and VQ without any

fuzzy parameters. Moreover, the adjustibility of the degree of fuzzy entropy n in FE

models is also an advantage. When conventional models do not work well in some

cases due to the insufficiency of the training data or the complexity of the speech data,

such as the nine English E-set words, a suitable value of n can be found to obtain

better models. Experimental results for these models will be reported in Chapter 7.

FE Models

❄ ❄FE

Discrete ModelsFE

Continuous Models

❄ ❄ ❄ ❄FE

DHMMNC-FEDHMM

NCCHMM

NC-FECHMM

❄ ❄FE

GMMNC-FEGMM

❄ ❄FEVQ

NC-FEVQ

Figure 3.7: Fuzzy entropy models for speech and speaker recognition

Page 81: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 4

Fuzzy C-Means Models

This chapter proposes a fuzzy approach based on fuzzy C-means (FCM) clustering to

speech and speaker recognition. Models in this approach can be called FCM models

and are estimated by the minimum fuzzy squared-error criterion used in FCM clustering.

This criterion is different from the maximum likelihood criterion, hence FCM models

cannot reduce to statistical models. However, the parameter estimation procedure of

FCM models is similar to that of FE models, where distances are defined in the same way.

In this chapter, the fuzzy EM algorithm is reformulated for the minimum fuzzy squared-

error criterion in Section 4.1. FCM models for HMMs, GMMs and VQ are presented in

the next Sections 4.2, 4.3 and 4.4, respectively. The noise clustering approach is also

considered for these models. A discussion on the role of fuzzy memberships of FCM

models and a comparison between FCM models and FE models are presented in Section

4.5.

4.1 Minimum Fuzzy Squared-Error Estimation

The fuzzy squared-error function in (2.44) page 38 is used as an optimisation criterion

to estimate FCM models. For convenience, we reintroduce this function as follows

Jm(U, λ; X) =C∑

i=1

T∑t=1

umit d

2it (4.1)

where U = {uit} is a fuzzy C-partition of X, m > 1 is a weighting exponent on

each fuzzy membership uit and is called the degree of fuzziness, λ and dit are defined

66

Page 82: Fuzzy Approaches to Speech and Speaker Recognition

4.2 Fuzzy C-Means Hidden Markov Models 67

for particular models. Minimising the function Jm(U, λ; X) over the variables U

and λ is implemented by an iteration of the two steps: 1) Finding U such that

Jm(U, λ; X) ≤ Jm(U, λ; X), and 2) Finding λ such that Jm(U, λ; X) ≤ Jm(U, λ; X).

The fuzzy EM algorithm on page 52 is reformulated for FCM models as follows

[Tran and Wagner 1999b]

Algorithm 3 (The Fuzzy EM Algorithm)

1. Initialisation: Fix m and choose an initial estimate λ

2. Fuzzy E-step: Compute U and Jm(U, λ; X)

3. M-step: Use a certain minimisation method to determine λ , for which Jm(U, λ; X)

is minimised

4. Termination: Set λ = λ and U = U , repeat from the E-step until the change of

Jm(U, λ; X) falls below a preset threshold.

4.2 Fuzzy C-Means Hidden Markov Models

This section proposes a parameter estimation procedure for the HMM using the min-

imum fuzzy squared-error estimation. Noise clustering and possibilistic C-means

approaches are also considered for both discrete and continuous HMMs.

4.2.1 FCM Discrete HMM

From (4.1), the fuzzy squared-error function for the FCM discrete HMM (FCM-

DHMM) is proposed as follows [Tran and Wagner 1999c]

Jm(U, λ; O) =T−1∑t=0

∑st

∑st+1

umstst+1

d2stst+1

(4.2)

where m > 1, ustst+1 = ustst+1(O) and d2stst+1

= − log P (O, st, st+1|λ). Note that πs1

is denoted by as0s1 in (4.2) for simplicity. Assuming that we are in state i at time t

and state j at time t + 1, the function Jm(U, λ; O) can be rewritten as follows

Jm(U, λ; O) =T−1∑t=0

N∑i=1

N∑j=1

umijtd

2ijt (4.3)

Page 83: Fuzzy Approaches to Speech and Speaker Recognition

4.2 Fuzzy C-Means Hidden Markov Models 68

where uijt and dijt are defined in (3.22) and (3.21), respectively. Minimising the

function Jm(U, λ; O) over U and λ gives the following fuzzy EM algorithm for the

FCM-DHMM

Fuzzy E-Step: Minimising the function Jm(U, λ; O) in (4.2) over U gives

uijt =

[N∑

k=1

N∑l=1

(d2

ijt

/d2

klt

)1/(m−1)]−1

(4.4)

M-step: Replacing the distance (3.21) into (4.3), we can regroup the function in

(4.3) into four terms as follows

Jm(U, λ; O) = −N∑

j=1

( N∑i=1

umij0

)log πj −

N∑i=1

N∑j=1

( T−1∑t=1

umijt

)log aij

−N∑

j=1

K∑k=1

( T∑t=1

s.t. ot=vk

N∑i=1

umijt

)log bj(k)

−N∑

i=1

N∑j=1

T−1∑t=1

umijt log [αt(i)βt+1(j)] (4.5)

where, similar to the FE-DHMM, the last term including αt(i)βt+1(j) is ignored,

since the forward-backward variables can be computed from π,A,B by the forward-

backward algorithm (see Section 2.3.4). Minimising the function Jm(U, λ; O) on

π,A,B is performed by using Lagrange multipliers and the same constraints as in

(3.26) [Tran and Wagner 1999c]. We obtain the parameter reestimation equations as

follows

πj =N∑

i=1

umij0, aij =

T−1∑t=1

umijt

T−1∑t=1

N∑j=1

umijt

, bj(k) =

T∑t=1

s.t. ot=vk

N∑i=1

umijt

T∑t=1

N∑i=1

umijt

(4.6)

4.2.2 FCM Continuous HMM

Similarly, using the fuzzy membership function ujkt and the distance djkt in (3.29)

and (3.30), respectively, we obtain the fuzzy EM algorithm for the FCM continuous

HMM (FCM-CHMM) as follows [Tran and Wagner 1999a]

Fuzzy E-Step:

ujkt =

[N∑

i=1

K∑l=1

(d2

jkt

/d2

ilt

)1/(m−1)]−1

(4.7)

Page 84: Fuzzy Approaches to Speech and Speaker Recognition

4.2 Fuzzy C-Means Hidden Markov Models 69

M-step: Similar to the FE-CHMM, the parameter estimation equations for the

π and A distributions are unchanged, but the output distribution B is estimated via

Gaussian mixture parameters (w, µ, Σ) as follows

wjk =

T∑t=1

umjkt

T∑t=1

K∑k=1

umjkt

, µjk =

T∑t=1

umjktxt

T∑t=1

umjkt

, Σjk =

T∑t=1

umjkt(xt − µjk)(xt − µjk)

T∑t=1

umjkt

(4.8)

4.2.3 Noise Clustering Approach

The concept of the garbage state in Section 3.4.4, page 58 is applied to the FCM-

DHMM. The fuzzy objective function for the FCM-DHMM in the NC approach (NC-

FCM-DHMM) is as follows [Tran and Wagner 1999a]

Jm(U, λ; O) =T−1∑t=0

N∑i=1

N∑j=1

umijtd

2ijt +

T−1∑t=0

(1 −N∑

i=1

N∑j=1

uijt)mδ2 (4.9)

The fuzzy EM algorithm for the NC-FCM-CHMM is as follows

Fuzzy E-Step:

uijt =1

N∑k=1

N∑l=1

(d2ijt/d

2klt)

1/(m−1) + (d2ijt/δ

2)1/(m−1)

(4.10)

where dijt is defined in (3.21). The second term in the denominator of (4.10) becomes

quite large for outliers, resulting in small membership values in all the good states

for outliers.

M-Step: identical to the M-step of the FCM-DHMM in Section 4.2.1.

Similarly, the FCM-CHMM in the NC approach (NC-FCM-CHMM) is as follows

Fuzzy E-Step:

ujkt =1

N∑i=1

K∑l=1

(d2jkt/d

2ilt)

1/(m−1) + (d2jkt/δ

2)1/(m−1)

(4.11)

where djkt is defined in (3.30).

M-Step: identical to the M-step of the FCM-CHMM in Section 4.2.2.

Page 85: Fuzzy Approaches to Speech and Speaker Recognition

4.3 Fuzzy C-Means Gaussian Mixture Models 70

4.3 Fuzzy C-Means Gaussian Mixture Models

Similar to FE-GMMs, FCM Gaussian mixture models (FCM-GMMs) are summarised

as follows.

4.3.1 Fuzzy C-Means GMM

For a training vector sequence X = (x1x2 . . .xT ), the fuzzy objective function of the

fuzzy C-means GMM (FCM-GMM) is [Tran et al. 1998a]

Jm(U, λ; X) =K∑

i=1

T∑t=1

umit d

2it (4.12)

where

d2it = − log P (xt, i|λ) = − log [wiN(xt, µi,Σi)] (4.13)

and uit = ui(xt) is the fuzzy membership function denoting the degree to which

feature vector xt belongs to Gaussian distribution i, satisfying

0 ≤ uit ≤ 1 ∀i, t,K∑

i=1

uit = 1 ∀t, 0 <T∑

t=1

uit < T ∀i (4.14)

The fuzzy EM algorithm for the FCM-GMM is as follows [Tran and Wagner 1998]

Fuzzy E-Step: Minimising the fuzzy objective function over U gives

uit =

[K∑

k=1

(d2

it

/d2

kt

)1/(m−1)]−1

(4.15)

M-Step: Minimising the fuzzy objective function over λ gives

wi =

T∑t=1

umit

T∑t=1

K∑k=1

umkt

, µi =

T∑t=1

umit xt

T∑t=1

umit

, Σi =

T∑t=1

umit (xt − µi)(xt − µi)

T∑t=1

umit

(4.16)

4.3.2 Noise Clustering Approach

The fuzzy objective function for the FCM-GMM in the NC approach (NC-FCM-

GMM) is as follows [Tran and Wagner 1999e]

Jm(U, λ; X) =T∑

t=1

K∑i=1

uitd2it +

T∑t=1

(1 −K∑

i=1

uit)mδ2 (4.17)

Page 86: Fuzzy Approaches to Speech and Speaker Recognition

4.4 Fuzzy C-Means Vector Quantisation 71

Minimising the fuzzy objective function on U and λ gives

Fuzzy E-Step:

uit =1

K∑k=1

(d2it/d

2kt)

1/(m−1) + (d2it/δ

2)1/(m−1)

(4.18)

M-Step: identical to the M-step of the FCM-GMM in Section 4.3.1.

4.4 Fuzzy C-Means Vector Quantisation

The reestimation algorithms for fuzzy C-means VQ (FCM-VQ) are identical to the

FCM algorithms reviewed in Section 2.4.6, page 37.

4.5 Comparison Between FCM and FE Models

We have considered three kinds of models: 1) Conventional models in Chapter 2,

2) FE models in Chapter 3, and 3) FCM models in this chapter. The relationship

between conventional and FE models has been discussed in Section 3.7 page 61 and

can be summarised in Figure 4.1. Conventional models (HMMs, GMMs and VQ) are

considered as a special group with a value of the degree of fuzzy entropy n = 1 within

the infinite family of FE model groups with values of n in the range (0,∞).

0 FE Models 1 FE Models n✲• •✻ ✻

Hard Models Conventional Models

Figure 4.1: The relationship between FE model groups versus the degree of fuzzy entropy

n

Such a relationship is not available for conventional models and FCM models. This

means that no suitable value of the degree of fuzziness m in (1,∞) can be set for

FCM models to reduce to the conventional models. However, a similar relationship

can be established between the group well-known in pattern recognition with m = 2

Page 87: Fuzzy Approaches to Speech and Speaker Recognition

4.5 Comparison Between FCM and FE Models 72

1 FCM Models 2 FCM Models m✲• •✻ ✻

Hard Models Typical Models

Figure 4.2: The relationship between FCM model groups versus the degree of fuzziness m

and other FCM model groups with m > 1 and m 6= 2. Figure 4.2 shows the similar

relationship between FCM models to that in Figure 4.1.

Therefore we discuss this relationship before comparing FCM and FE models.

Without loss of generality, let us consider the following problem for GMMs. Given

a vector xt, we assume that xt is closest to the cluster i among K clusters. Consider

the membership uit of the FCM-GMM. From (4.15), it can be rewritten as

uit =1

K∑k=1

(d2

it

d2kt

)1/(m−1)=

1

1 +K∑

k=1k 6=i

(d2

it

d2kt

)1/(m−1)(4.19)

Writing u∗it for the membership if m = 2, we obtain

u∗it = 1

/[1 +

K∑k=1k 6=i

(d2

it

d2kt

)](4.20)

As m > 2, we have1

m − 1< 1. From the above assumption for xt, we obtain

d2it

d2kt

< 1

∀k 6= i. Since the function ax decreases for a < 1, we can show that

uit < u∗it (4.21)

It can be easily shown for the remaining cases. In general, we obtain

• d2it ≤ d2

kt ∀k 6= i : xt is closest to cluster i

uit

> u∗

it 1 < m < 2

= u∗it m = 2

< u∗it m > 2

(4.22)

• d2it ≥ d2

kt ∀k 6= i : xt is furthest from cluster i

uit

< u∗

it 1 < m < 2

= u∗it m = 2

> u∗it m > 2

(4.23)

Page 88: Fuzzy Approaches to Speech and Speaker Recognition

4.5 Comparison Between FCM and FE Models 73

Comparing with the typical FCM model with m = 2, if we wish to decrease the

influence of vectors far from the cluster center, we reduce the value of m to less than

2. Inversely, values of m > 2 increase the influence of those vectors. The membership

function with different values of m versus the distance between vector xt and cluster

i is demonstrated in Figure 4.3, which is quite similar to that demonstrated in Figure

3.3 page 64 for FE models. The limit of m = 1 represents hard models, which will be

presented in Chapter 5.

Figure 4.3: The FCM membership function uit with different values of the degree of

fuzziness m versus the distance dit between vector xt and cluster i

It is known that model parameters are estimated by the memberships in the M-

step of reestimation algorithms, therefore selecting a suitable value of the degree of

fuzziness m is necessary to obtain optimum model parameter estimates. This can be

a solution for the insufficient training data problem. In this case, the quality of the

model parameter estimates trained by conventional methods cannot be guaranteed

and hence FCM models with the adjustable parameter m can be employed to find

better estimates. Although FE models also have the advantage of the adjustable

parameter n, the membership functions of FE and FCM models are different because

of the employed different optimisation criteria. We compare the expressions of FE

Page 89: Fuzzy Approaches to Speech and Speaker Recognition

4.5 Comparison Between FCM and FE Models 74

and FCM memberships taken from (3.42) and (4.15) as follows

FE: uit =e−d2

it/n

K∑k=1

e−d2kt/n

FCM: uit =

(1

d2it

)1/(m−1)

K∑k=1

(1

d2kt

)1/(m−1)(4.24)

For simplicity, we consider the typical cases where n = 1 and m = 2. It can be

seen that the FE membership employs the function f(x) = e−x whereas the FCM

membership employs the function f(x) = 1/x, where x = d2it. Figure 4.4 shows

curves representing these functions.

Figure 4.4: Curves representing the functions used in the FE and FCM memberships,

where x = d2it, m = 2 and n = 1

We can see that the change of the FCM membership is more rapid than the one of

the FE membership for short distances (0 < x < 1). Applied to cluster analysis, this

means that the FCM memberships of feature vectors close to the cluster center can

have very different values even if these vectors are close together, whereas their FE

memberships are not very different. On the other hand, in the M-step, FE and FCM

model parameters are estimated by uit and umit , respectively. For long distances, i.e.

for feature vectors very far from the cluster center, the difference between uit for the

FE membership and umit for the FCM membership is not significant. Indeed, although

Figure 4.4 shows the FCM membership value is much greater than the FE membership

Page 90: Fuzzy Approaches to Speech and Speaker Recognition

4.6 Summary and Conclusion 75

value for long distances, for estimating the model parameters in the M-step, the FCM

membership value is reduced due to the weighting exponent m (umit < uit as m > 1

and 0 ≥ uit ≥ 1).

4.6 Summary and Conclusion

Fuzzy C-means models have been proposed in this chapter. A parameter is introduced

as the degree of fuzziness m > 1. With m → 1, we obtain hard models, which will

be presented in Chapter 5. With m → ∞, we obtain maximally fuzzy models with

only a single state or a single cluster. Typical models with m = 2 are well known

in pattern recognition. The main differences between FE models and FCM models

are the optimisation criteria and the reestimation of the fuzzy membership functions.

The advantage of the FE and FCM models is that they have adjustable parameters

m and n, which may be useful for finding optimum models for solving the insufficient

training data problem. Relationships between the FCM models are summarised in

Figure 4.5 below. Experimental results for these models will be reported in Chapter

6.

FCM Models

❄ ❄FCM

Discrete ModelsFCM

Continuous Models

❄ ❄ ❄ ❄FCM

DHMMNC-FCMDHMM

NC-FCMCHMM

NC-FCMCHMM

❄ ❄FCMGMM

NC-FCMGMM

❄ ❄FCMVQ

NC-FCMVQ

Figure 4.5: Fuzzy C-means models for speech and speaker recognition

Page 91: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 5

Hard Models

As the degrees of fuzzy entropy and fuzziness tend to their minimum values, both the

fuzzy entropy and the fuzzy C-means models approach the hard model. For fuzzy and

conventional models, the model structures are not fundamentally different, except for the

use of fuzzy optimisation criteria and the fuzzy membership. However, a different model

structure applies to the hard models because of the binary (zero-one) membership function.

For example, the hard HMM employs only the best path for estimating model parameters

and for recognition, and the hard GMM employs only the most likely Gaussian distribution

among the mixture of Gaussians to represent a feature vector. Although the smoothed

(fuzzy) membership is more successful than the binary (hard) membership in describing

the model structures, hard models can also be used because they are simple yet effective.

The simplest hard model is the VQ model, which is effective for speaker recognition.

This chapter proposes new hard models—hard HMMs and hard GMMs. These models

emerge as interesting consequences of investigating fuzzy approaches. Sections 5.2 and

5.3 present hard models for HMMs and GMMs, respectively. The last section gives a

summary and a conclusion for hard models.

5.1 From Fuzzy To Hard Models

Fuzzy and hard models have a mutual relation: fuzzy models are obtained by fuzzi-

fying hard models, or inversely we can derive hard models by defuzzification of fuzzy

models. We consider this relation for the simplest models: VQ (hard C-means) and

76

Page 92: Fuzzy Approaches to Speech and Speaker Recognition

5.1 From Fuzzy To Hard Models 77

fuzzy VQ (FE-VQ and FCM-VQ). As presented in Chapters 3 and 4, the fuzzifica-

tion of the hard objective function (the sum-of-squared-error function) to the fuzzy

objective function is achieved by adding a fuzzy entropy term to the hard objective

function for FE-VQ or applying a weighting exponent m > 1 to each uit for FCM-VQ.

Figure 5.1 shows this fuzzification. Inversely, the defuzzification of FE-VQ by letting

n → 0 or FCM-VQ by letting m → 1 results in the same VQ. To obtain a simpler

calculation, a more convenient way is to use the minimum distance rule mentioned

in Equation (2.41) on page 37 to implement the defuzzification. Figure 5.2 shows the

defuzzification methods.

Hn(U, λ; X) =C∑

i=1

T∑t=1

uitd2it + n

C∑i=1

T∑t=1

uit log uit

✲FE

✲FCM

J(U, λ; X) =C∑

i=1

T∑t=1

uitd2it

Jm(U, λ; X) =C∑

i=1

T∑t=1

umit d2

it

Hard: uit = 0 or 1 Fuzzy: uit ∈ [0, 1]

Figure 5.1: From hard VQ to fuzzy VQ: an additional fuzzy entropy term for fuzzy

entropy VQ, and a weighting exponent m > 1 on each uit for fuzzy C-means VQ.

The mutual relation between fuzzy and hard VQ models has already been reported

in the literature. In this thesis, we show that this relation can be applied to more

general models, such as the GMM and the HMM. Indeed, the fuzzy HMMs and

FE-VQ: n → 0

FCM-VQ: m → 1

Minimum distance rule: uit ={

1 if dit < dkt ∀k

0 otherwise

✲ VQ

Figure 5.2: From fuzzy VQ to (hard) VQ: n → 0 for FE-VQ or m → 1 for FCM-VQ

or using the minimum distance rule to compute uit directly.

Page 93: Fuzzy Approaches to Speech and Speaker Recognition

5.2 Hard Hidden Markov Models 78

the fuzzy GMMs presented in Chapters 3 and 4 are generalised from fuzzy VQ by

using the distance-probability relation d2XY = − log P (X,Y ) that relates clustering

to modelling. In this chapter, the distance-probability relation is used to derive hard

HMMs and hard GMMs from fuzzy HMMs and fuzzy GMMs, respectively. Applying

this relation to the minimum distance rule, we obtain a probabilistic rule to determine

the hard membership value uY (X) as follows

uY (X) =

{1 if P (X,Y ) > P (X,Z) ∀Z 6= Y

0 otherwise(5.1)

where X denotes observable data and Y , Z denote unobservable data. This rule can

be called the maximum joint probability rule. Figure 5.3 shows the mutual relations

between fuzzy and hard models used in this thesis.

Hard HMMsHard GMMs

Minimum distance rule✛ Fuzzy HMMsFuzzy GMMs

✻d2

XY = − log P (X,Y )

VQFuzzification ✲✛

DefuzzificationFuzzy VQ

Figure 5.3: Mutual relations between fuzzy and hard models

5.2 Hard Hidden Markov Models

As discussed in Section 3.4.1, the membership uijt denotes the belonging of an obser-

vation sequence O to fuzzy state sequences being in state i at time t and state j at

time t + 1. There are N possible fuzzy states at each time t and they are concate-

nated into fuzzy state sequences. Figure 5.4 illustrates possible fuzzy state sequences

starting from state 1 at time t = 1 and ending at state 3 at time t = T in a 3-state

left-to-right HMM with Δi = 1 (the Bakis HMM) and also in the corresponding fuzzy

HMM. As discussed above, the hard membership function takes only two values 0 and

1. This means that in hard HMMs, each observation ot belongs only to the most likely

state at each time t. In other words, the sequence O is in the most likely single state

Page 94: Fuzzy Approaches to Speech and Speaker Recognition

5.2 Hard Hidden Markov Models 79

Figure 5.4: Possible state sequences in a 3-state Bakis HMM and a 3-state fuzzy

Bakis HMM

sequence. This is quite similar to the state sequence in HMMs using the Viterbi

algorithm, where if we round maximum probability values to 1 and others to 0, we

obtain hard HMMs. Therefore, conventional HMMs using the Viterbi algorithm can

be regarded as “pretty” hard HMMs. This remark gives an alternative approach to

hard HMMs from conventional HMMs based on the Viterbi algorithm. Figure 5.5

illustrates a possible single state sequence in the hard HMM [Tran et al. 2000a].

Figure 5.5: A possible single state sequence in a 3-state hard HMM

Page 95: Fuzzy Approaches to Speech and Speaker Recognition

5.2 Hard Hidden Markov Models 80

5.2.1 Hard Discrete HMM

This section shows the reestimation algorithm for the hard discrete HMM (H-DHMM),

where the maximum joint probability rule in (5.1) is formulated as the “hard” E-step.

In previous chapters, we have used the fuzzy membership uijt for FE-DHMMs and

FCM-DHMMs. However, for reducing calculation, the hard membership uijt can be

computed by the product of the memberships uit and uj(t+1). Indeed, uijt = 1 means

that the observation sequence O is in state i at time t (uit = 1) and state j at time

t + 1 (uj(t+1) = 1), thus uijt = uituj(t+1) = 1. It is quite similar for the remaining

cases. Therefore we use the following distance for the membership uit

d2it = − log P (O, st = i|λ) (5.2)

From (2.26) page 29, we can show that

P (O, st = i|λ) = αt(i)βt(i) (5.3)

where αt(i) and βt(i) are computed in (2.24) and (2.25) page 28. Using the maximum

joint probability rule in (5.1), the reestimation algorithm for H-DHMM is as follows

Hard E-Step:

uit =

{1 if αt(i)βt(i) > αt(k)βt(k) ∀k 6= i

0 otherwise(5.4)

Ties are broken randomly.

M-Step:

πj = uj1, aij =

T−1∑t=1

uituj(t+1)

T−1∑t=1

uit

, bj(k) =

T∑t=1

s.t. ot=vk

ujt

T∑t=1

ujt

(5.5)

5.2.2 Hard Continuous HMM

The distance in (3.30) page 57 defined for the FE-CHMM and the FCM-CHMM is

used for the hard continuous HMM (H-CHMM). Using the maximum joint probability

rule in (5.1), the reestimation algorithm for H-CHMM is as follows

Page 96: Fuzzy Approaches to Speech and Speaker Recognition

5.3 Hard Gaussian Mixture Models 81

Hard E-Step:

ujkt =

{1 if P (X, st = j,mt = k|λ) > P (X, st = h,mt = l|λ) ∀(h, l) 6= (j, k)

0 otherwise(5.6)

where ties are broken randomly and P (X, st = j,mt = k|λ) is computed in (3.30).

M-Step:

wjk =

T∑t=1

ujkt

T∑t=1

K∑k=1

ujkt

, µjk =

T∑t=1

ujktxt

T∑t=1

ujkt

, Σjk =

T∑t=1

ujkt(xt − µjk)(xt − µjk)′

T∑t=1

ujkt

(5.7)

5.3 Hard Gaussian Mixture Models

In fuzzy Gaussian mixture models, the membership uit denotes the belonging of vector

xt to cluster i represented by a Gaussian distribution. Since 0 ≤ uit ≤ 1, vector xt

is regarded as belonging to a mixture of Gaussian distributions. In hard Gaussian

mixture models (H-GMMs), the membership uit takes only two values 0 and 1. This

means that vector xt belongs to only one Gaussian distribution, or in other words,

Gaussian distributions are not mixed, they are separated by boundaries as in VQ.

Figure 5.6 illustrates a mixture of three Gaussian distributions in fuzzy GMMs and

Figure 5.7 illustrates three separate Gaussian distributions in the H-GMM. With

this interpretation, H-GMMs should be termed hard C-Gaussians models (from the

terminology “hard C-means”) or K-Gaussians models (from the terminology “K-

means”) [Tran et al. 2000b].

The distance in (3.40) page 59 defined for the FE-GMM is used for the H-GMM.

From the maximum joint probability rule in (5.1), the reestimation algorithm for the

H-GMM is formulated as follows

Hard E-Step:

uit =

{1 if P (xt, i|λ) > P (xt, k|λ) ∀i 6= k

0 otherwise(5.8)

where ties are broken randomly and

P (xt, i|λ) = wiN(xt, µi,Σi) (5.9)

Page 97: Fuzzy Approaches to Speech and Speaker Recognition

5.3 Hard Gaussian Mixture Models 82

Figure 5.6: A mixture of three Gaussian distributions in the GMM or the fuzzy GMM.

Figure 5.7: A set of three non overlapping Gaussian distributions in the hard GMM.

Page 98: Fuzzy Approaches to Speech and Speaker Recognition

5.4 Summary and Conclusion 83

wi and N(xt, µi,Σi) are defined in Section 2.3.5, page 29.

M-Step:

wi =

T∑t=1

uit

T∑t=1

K∑k=1

ukt

, µi =

T∑t=1

uitxt

T∑t=1

uit

, Σi =

T∑t=1

uit(xt − µi)(xt − µi)′

T∑t=1

uit

(5.10)

It can be seen that the above reeestimation algorithm is the most generalised

algorithm of the VQ algorithms mentioned in Section 2.3.6 page 31. Indeed, let Ti be

the number of vectors in cluster i, from (5.8) we obtain

T∑t=1

uit = Ti (5.11)

Replacing (5.11) into (5.10) gives an alternative form for the M-step as follows

M-Step:

wi =Ti

T, µi =

1

Ti

∑xt∈Ci

xt, Σi =1

Ti

∑xt∈Ci

(xt − µi)(xt − µi)′ (5.12)

Since covariance matrices are also reestimated, this algorithm is more generalised

than the ECVQ algorithm reviewed in (2.36), page 32.

5.4 Summary and Conclusion

Hard models have been presented in this chapter. There are three ways to obtain hard

models: 1) Using the fuzzy entropy algorithm with n ≈ 0; 2) Using the fuzzy C-means

algorithm with m ≈ 1; and 3) Using the nearest prototype rule. The third way has

simpler calculations. We have proposed hard HMMs and hard GMMs, where only

the best state sequence is employed in hard HMMs and a non overlapping Gaussian

distribution set is employed in hard GMMs. Conventional HMMs using the Viterbi

algorithm can be regarded as “pretty” hard HMMs. Hard GMMs are regarded as the

most generalised VQ models from which VQ, extended VQ and entropy-constrained

VQ can be derived. Conventional HMMs using the Viterbi algorithm and VQ models

are widely used, which means that hard models play an important role in speech

and speaker recognition. Relationships between the hard models are summarised in

Page 99: Fuzzy Approaches to Speech and Speaker Recognition

5.4 Summary and Conclusion 84

Figure 5.8 and experimental results for hard models will be reported in Chapter 6.

Hard Models

❄ ❄Hard

DHMMHard

CHMMλ = {π,A,B} λ = {π,A,B = {w, µ,Σ}}

❄HardGMM λ = {w, µ, Σ}

❄ ❄ ❄

ECVQ

λ = {µ,w}

ExtendedVQ

λ = {µ,Σ}

VQ

λ = {µ}

Figure 5.8: Relationships between hard models

Page 100: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 6

A Fuzzy Approach to Speaker

Verification

Fuzzy approaches have been in Chapters 3 and 4 to train speech and speaker models.

This chapter proposes an alternative fuzzy approach to speaker verification. For an input

utterance and a claimed identity, most of the current methods compute a claimed speaker’s

score, which is the ratio of the claimed speaker’s and the impostors’ likelihood functions,

and compare this score with a given threshold to accept or reject this speaker. Considering

the speaker verification problem based on fuzzy set theory, the claimed speaker’s score is

viewed as the fuzzy membership function of the input utterance in the claimed speaker’s

fuzzy set of utterances. Fuzzy entropy and fuzzy C-means membership functions are

proposed as fuzzy membership scores, which are the ratios of functions of the claimed

speaker’s and impostors’ likelihood functions. So a likelihood transformation is considered

to relate current likelihood and fuzzy membership scores. Based on this consideration,

more fuzzy scores are proposed to compare with current methods. Furthermore, the

noise clustering method supplies a very effective modification to all methods, which can

overcome some of the problems of ratio-type scores and greatly reduce the false acceptance

rate.

Some basic concepts of speaker verification relevant to this chapter have been reviewed

in Section 2.2.2 page 18. A more detailed analysis of a speaker verification system and of

current normalisation methods are provided in this chapter.

85

Page 101: Fuzzy Approaches to Speech and Speaker Recognition

6.1 A Speaker Verification System 86

6.1 A Speaker Verification System

Let λ0 be the claimed speaker model and λ be a model representing all other possible

speakers, i.e. impostors 1. For a given input utterance X and a claimed identity,

the choice is between the hypothesis H0: X is from the claimed speaker λ0, and

the alternative hypothesis H1: X is from the impostors λ. A claimed speaker’s score

S(X) is computed to reject or accept the speaker claim. Depending on the meaning of

the score, we can distinguish between similarity scores L(X) and dissimilarity scores

D(X) between X and λ0. Likelihood scores are included in L(X) and VQ distortion

scores are included in D(X). These scores satisfy the following rules

L(X)

{> θL accept

≤ θL reject(6.1)

and

D(X)

{< θD accept

≥ θD reject(6.2)

where θL and θD are the decision thresholds. Figure 6.1 presents a typical speaker

verification system.

✲Claimed Identity✗✖

✔✕

SpeakerModels

✗✖

✔✕

Threshold

✲Input Speech SpeechProcessing

✲ ScoreDetermination

✲ HypothesisTesting

✲Acceptor Reject

Figure 6.1: A Typical Speaker Verification System

Assuming that speaker models are given, this chapter solves the problem of finding

the effective scores of the claimed speaker such that the equal error rate (EER)

mentioned in Section 2.2.2 page 18 is minimised. We define equivalent scores as

scores giving the same equal error rate (EER) even though they possibly use different

thresholds. For example, Sa(X) = P (X|λ0) and Sb(X) = log P (X|λ0) are equivalent

scores, but use thresholds θ and log θ, respectively .

1In this context, we use the term imposter for any speaker other than the claimed speaker and

without any implication of fraudulent intent or active voice manipulation

Page 102: Fuzzy Approaches to Speech and Speaker Recognition

6.2 Current Normalisation Methods 87

6.2 Current Normalisation Methods

The simplest method of scoring is to use the absolute likelihood score (unnormalised

score) of an utterance. In the log domain, that is

L0(X) = log P (X|λ0) (6.3)

This score is strongly influenced by variations in the test utterance such as the

speaker’s vocal characteristics, the linguistic content and the speech quality. It is

very difficult to set a common decision threshold to be used over different tests. This

drawback is overcome to some extent by using normalisation. According to the Bayes

decision rule for minimum risk, a likelihood ratio

L1(X) =P (X|λ0)

P (X|λ)(6.4)

is used. This ratio produces a relative score which is less volatile to non-speaker

utterance variations [Reynolds 1995a]. In the log domain, (6.4) is equivalent to the

following normalisation technique, proposed by Higgins et al., 1991

L1(X) = log P (X|λ0) − log P (X|λ) (6.5)

The term log P (X|λ) in (6.5) is called the normalisation term and requires calculation

of all impostors’ likelihood functions. An appoximation of this method is to use only

the closest impostor model for calculating the normalisation term [Liu et al. 1996]

L2(X) = log P (X|λ0) − maxλ6=λ0

log P (X|λ) (6.6)

However when the size of the population increases, both of these normalisation meth-

ods L1(X) and L2(X) are unrealistic since all impostors’ likelihood functions must

be calculated for determining the value of the normalisation term. Therefore a sub-

set of the impostor models is used. This subset consists of B “background” speaker

models λi, i = 1, . . . , B and is representative of the population close to the claimed

speaker, i.e. the “cohort speaker” set [Rosenberg et al. 1992]. Depending on the ap-

proximation of P (X|λ) in (6.4) by the likelihood functions of the background model

set P (X|λi), i = 1, . . . , B, we obtain different normalisation methods. An approxi-

mation [Reynolds 1995a] has been applied that is the arithmetic mean (average) of

Page 103: Fuzzy Approaches to Speech and Speaker Recognition

6.2 Current Normalisation Methods 88

the likelihood functions of B background speaker models. The corresponding score

for this approximation is

L3(X) = log P (X|λ0) − log{

1

B

B∑i=1

P (X|λi)}

(6.7)

If the claimed speaker’s likelihood function is also included in the above arithmetic

mean, we obtain the normalisation method based on the a posteriori probability

[Matsui and Furui 1993]

L4(X) = log P (X|λ0) − logB∑

i=0

P (X|λi) (6.8)

Note that i = 0 in (6.8) denotes the claimed speaker model and the constant term

1/B is accounted for in the decision threshold. If the geometric mean is used instead

of the arithmetic mean to approximate P (X|λ), we obtain the normalisation method

[Liu et al. 1996] as follows

L5(X) = log P (X|λ0) −1

B

B∑i=1

log P (X|λi) (6.9)

Normalisation methods can also be applied to the likelihood function of each vector

xt, t = 1, . . . , T in X, and such methods are called frame level normalisation methods.

Such a method has been proposed as follows [Markov and Nakagawa 1998b]

L6(X) =T∑

t=1

[log P (xt|λ0) − log

B∑i=1

P (xt|λi)]

(6.10)

For VQ-based speaker verification systems, the following score is widely used

D1(X) = D(X,λ0) −1

B

B∑i=1

D(X,λi) (6.11)

where

D(X,λi) =T∑

t=1

d(i)kt (6.12)

where d(i)kt is the VQ distance between vector xt ∈ X and the nearest codevector k in

the codebook λi. It can be seen that this score is equivalent to L5(X) if we replace

D(X,λi) in (6.11) by [− log P (X|λi)], i = 0, 1, . . . , B.

Page 104: Fuzzy Approaches to Speech and Speaker Recognition

6.3 Proposed Normalisation Methods 89

6.3 Proposed Normalisation Methods

Consider the speaker verification problem in fuzzy set theory. To accept or reject

the claimed speaker, the task is to make a decision whether the input utterance X is

either from the claimed speaker λ0 or from the set of impostors λ, based on comparing

the score for X and a decision threshold θ. Thus the space of input utterances can

be considered as consisting of two fuzzy sets: C for the claimed speaker and I for

impostors. Degrees of belonging of X to these fuzzy sets are denoted by the fuzzy

membership functions of X, where the fuzzy membership of X in C can be regarded

as a claimed speaker’s score satisfying the rule in (6.1). Making a (hard) decision is

thus a defuzzification process where X is completely in C if the fuzzy membership of

X in C is sufficiently high, i.e greater than the threshold θ.

In theory, there are many ways to define the fuzzy membership function, therefore

it can be said that this fuzzy approach proposes more general scores than the current

likelihood ratio scores for speaker verification. These are termed fuzzy membership

scores, which can denote the belonging of X to the claimed speaker. Their values

need not be scaled into the interval [0, 1] because of the above-mentioned equivalence

of scores. Based on this discussion, all of the above-mentioned likelihood-based scores

can also be viewed as fuzzy membership scores.

The next task is to find effective fuzzy membership scores. To do this, we need

to know what the shortcoming of likelihood ratio scores is. The main problem comes

from the relative nature of a ratio. Indeed, assuming L3(X) in (6.7) is used, consider

the two equal likelihood ratios in the following example

L3(X1) =0.07

0.03= L3(X2) =

0.0000007

0.0000003(6.13)

where both X1 and X2 are accepted if assuming the given threshold is 2. The first ratio

can lead to a correct decision that the input utterance X1 is from the claimed speaker

(true acceptance). However it is improbable that X2 is from the claimed speaker or

from any of background speakers since both likelihood values in the second ratio are

very low. X2 is probably from an impostor and thus a false acceptance can occur on

the basis of the likelihood ratio. This is a similar problem to that addressed by Chen

et al. [1994].

Two fuzzy membership scores proposed in previous chapters can overcome this

Page 105: Fuzzy Approaches to Speech and Speaker Recognition

6.3 Proposed Normalisation Methods 90

Utterance P (X|λ0) P (X|λ1) P (X|λ2) P (X|λ3)

Xc1 3.02 × 10−4 2.23 × 10−5 4.16 × 10−5 1.52 × 10−6

Xc2 1.11 × 10−3 1.4 × 10−4 1.42 × 10−4 2.03 × 10−5

X i3 7.91 × 10−6 1.17 × 10−7 1.24 × 10−6 7.23 × 10−8

X i4 1.2 × 10−4 7.16 × 10−6 6.64 × 10−8 6.51 × 10−8

Table 6.1: The likelihood values for 4 input utterances X1 − X4 against the claimed

speaker λ0 and 3 impostors λ1 − λ3, where Xc1, Xc

2 are from the claimed speaker and

X i3, X i

4 are from impostors

problem. The fuzzy entropy (FE) score L7(X) and the fuzzy C-means (FCM) score

L8(X) are rewritten as follows

L7(X) =[P (X|λ0)]

1/n

B∑i=0

[P (X|λi)]1/n

(6.14)

L8(X) =[− log P (X|λ0)]

1/(1−m)

B∑i=0

[− log P (X|λi)]1/(1−m)

(6.15)

where m > 1, n > 0, and the impostors’ fuzzy set I is approximately represented by

B background speakers’ fuzzy subsets. Note that in the log domain, as n = 1, the

score L7(X) after taking the logarithm reduces to the score based on the a posteriori

probability L4(X) in (6.8). Experimental results in the next chapter show low EERs

for these effective scores.

To illustrate the effectiveness of these scores, for simplicity, we compare L3(X)

with L8(X) in a numerical example. Table 6.1 presents the likelihood values for 4

input utterances X1 − X4 against the claimed speaker λ0 and 3 impostors λ1 − λ3,

where Xc1, Xc

2 are from the claimed speaker and X i3, X i

4 are from impostors (these are

real values from experiments on the TI46 database). Given a score L(X), the EER

= 0 in the case that all the scores for Xc1 and Xc

2 are greater than all those for X i3

and X i4.

Table 6.2 shows scores in (6.7) and (6.15) computed using these likelihood values

where m = 2 is applied to (6.15). It can be seen that with the score L3(X), we always

have the EER 6= 0 since scores for X i3 and X i

4 are higher than those for Xc1 and Xc

2.

Page 106: Fuzzy Approaches to Speech and Speaker Recognition

6.3 Proposed Normalisation Methods 91

Score Xc1 Xc

2 X i3 X i

4

L3(X) 2.628 2.399 2.810 3.899

L8(X) 0.316 0.316 0.302 0.350

Table 6.2: Scores of 4 utterances using L3(X) and L8(X)

However using L8(X), the EER is reduced since the score for X i3 is lower than those

for Xc1 and Xc

2.

A more robust method proposed in this chapter is to use fuzzy membership scores

based on the noise clustering (NC) method. Indeed, this fuzzy approach can reduce

the false acceptance error by forcing the membership value of the input utterance X

to become as small as possible if X is really from impostors, not from the claimed

speaker or background speakers. This fuzzy approach is simple but very effective,

just adding a suitable constant value ɛ > 0 (similar to the constant distance δ in the

NC method) to denominators of ratios, i.e. normalisation terms as follows

Change∑

. . . in the normalisation term to∑

. . . + ɛ (6.16)

Note that the NC method can be applied not only fuzzy membership scores but also

likelihood ratio scores. For example, NC-based versions of L3(X), L7(X) and L8(X)

are as follows

L3nc(X) = log P (X|λ0) − log[

1

B

B∑i=1

P (X|λi) + ɛ3

](6.17)

L7nc(X) =[P (X|λ0)]

1/n

B∑i=0

[P (X|λi)]1/n + ɛ7

(6.18)

L8nc(X) =[− log P (X|λ0)]

1/(1−m)

B∑i=0

[− log P (X|λi)]1/(1−m) + ɛ8

(6.19)

where the index “nc” means “noise clustering”. For illustration, applying NC-based

Page 107: Fuzzy Approaches to Speech and Speaker Recognition

6.4 The Likelihood Transformation 92

Score Xc1 Xc

2 X i3 X i

4

L3nc(X) 2.251 1.710 -2.547 0.158

L8nc(X) 0.139 0.152 0.109 0.136

Table 6.3: Scores of 4 utterances using L3nc(X) and L8nc(X)

scores for the first example in (6.13) with ɛ3 = 0.01 gives

L3nc(Xc1) =

0.07

0.03 + 0.01= 1.75 > L3nc(X

c2) =

0.0000007

0.0000003 + 0.01= 0.000069

(6.20)

For the second example in Table 6.3, with ɛ3 = 10−4 and ɛ8 = 0.5, NC-based scores

are shown in Table 6.3. We can see that with thresholds θ3 = 1.0 and θ8 = 0.137, the

EER is 0 for both of these NC-based scores.

We have illustrated the effectiveness of fuzzy membership scores by way of some

numerical examples. To be more rigorous, this approach should be considered in

theoretical terms. Let us take into account expressions of likelihood ratio scores and

fuzzy membership scores. The former is the ratio of likelihood functions whereas the

latter is the ratio of functions of likelihood functions. Indeed, denoting P as the

likelihood function, we can see that the FE score employs the function f(P ) = P 1/n

and the FCM score employs f(P ) = ( − log P )1/(m−1). In other words, likelihood

ratio scores are transformed to FE and FCM scores by using these functions. Such a

transformation is considered in the next section.

6.4 The Likelihood Transformation

Consider a transformation T : P → T (P ), where P is the likelihood function and

T (P ) is a certain continuous function of P . For example, T [P (X|λ0)] = log P (X|λ0).

Applying this transformation to the likelihood ratio score L1(X) gives an alternative

score S(X)

S(X) =T [P (X|λ0)]

T [P (X|λ)](6.21)

Page 108: Fuzzy Approaches to Speech and Speaker Recognition

6.4 The Likelihood Transformation 93

The difference between S(X) and L1(X) is

S(X) − L1(X) =P (X|λ0)

T [P (X|λ)]

[T [P (X|λ0)]

P (X|λ0)− T [P (X|λ)]

P (X|λ)

](6.22)

Assuming that in (6.22) T (P )/P is an increasing function, i.e. if P1/P2 > 1 or

P1 > P2 then T (P1)/P1 > T (P2)/P2, and T (P ) is a negative function for 0 ≤ P ≤ 1.

The two following cases can occur

• P (X|λ0)

P (X|λ)> 1: the expression on the right hand side of (6.22) is negative, thus

T [P (X|λ0)]

T [P (X|λ)]<

P (X|λ0)

P (X|λ)⇔ S(X) < L1(X) (6.23)

• P (X|λ0)

P (X|λ)≤ 1: the expression on the right hand side of (6.22) is non negative,

thus

T [P (X|λ0)]

T [P (X|λ)]≥ P (X|λ0)

P (X|λ)⇔ S(X) ≥ L1(X) (6.24)

In the first case, the transformation moves the likelihood ratios greater than 1 to the

transformed likelihood ratios less than 1 and vice versa in the second case, the ratios

less than 1 are moved to the transformed ratios greater than 1. Figure 6.2 illustrates

this transformation. This transformation is a nonlinear mapping since distances be-

Figure 6.2: The transformation T where T (P )/P increases and T (P ) is non positive

for 0 ≤ P ≤ 1: values of 4 ratios at A, B, C, and D are moved to those at A’, B’, C’,

and D’

tween different ratios and their transformations are different. For example, distances

Page 109: Fuzzy Approaches to Speech and Speaker Recognition

6.4 The Likelihood Transformation 94

AA’, BB’, CC’ and DD’ in Figure 6.4 are different. If T (P ) = log P , a numerical

example for these ratios is as follows

A :0.011

0.009' 1.222 −→ A′ :

log 0.011

log 0.009' 0.957

B :0.0009

0.0008' 1.125 −→ B′ :

log 0.0009

log 0.0008' 0.9834

C :0.01

0.009' 1.111 −→ C ′ :

log 0.01

log 0.009' 0.978

D :0.044

0.051' 0.863 −→ D′ :

log 0.044

log 0.051' 1.049 (6.25)

Comparisons of the order of the 4 points A, B, C, D with the order of A’, B’, C’

D’ and of likelihood values at point B (0.0009 and 0.0008) with other values show

that this transformation T can “recognise” ratios of small likelihood values and thus

leads to a reduction of the false acceptance error rate (and the EER also) as shown

in the previous section’s examples. Figure 6.3 shows histograms of a female speaker

labelled f7 in the TI46 corpus using 16-mixture GMMs, where the score L3(X) =

log[P (X|λ0)

/1B

∑Bi=1 P (X|λi)

]in Figure 6.3a is transformed by T (P ) = log P to

the score D3(X) = log[log P (X|λ0)

/log { 1

B

∑Bi=1 P (X|λi)}

]in Figure 6.3b.

Figure 6.3: Histograms of speaker f7 in the TI46 using 16-mixture GMMs. The EER

is 6.67% for Fig. 6.3a and is 5.90% for Fig. 6.3b.

Considerations where T (P )/P is a decreasing function or T (P ) is a positive func-

tion can be similarly considered. In general, there may not exist a function T (P ) for

Page 110: Fuzzy Approaches to Speech and Speaker Recognition

6.4 The Likelihood Transformation 95

the transformation T such that the EER is 0. However, a function T (P ) for reduc-

ing the EER may exist. For convenience in calculating products of probabilities, the

function T (P ) should be related to the logarithm function. This is also convenient

for applying these methods to VQ-based speaker verification since the distance in VQ

can be defined as the negative logarithm of the corresponding likelihood function.

Based on this transformation and to compare with current likelihood ratio scores,

we also propose more fuzzy scores for the decision rule in (6.2) as follows

D2(X) =log P (X|λ0)

maxλ6=λ0

log P (X|λ)(6.26)

D3(X) =log P (X|λ0)

log{

1

B

B∑i=1

P (X|λi)} (6.27)

D4(X) =log P (X|λ0)

log{

1

B

B∑i=0

P (X|λi)} (6.28)

D5(X) =log P (X|λ0)

1

B

B∑i=1

log P (X|λi)

(6.29)

D6(X) =arctan[log P (X|λ0)]

1

B

B∑i=1

arctan[log P (X|λi)]

(6.30)

where T [P (X|λ)] of the impostors is approximated by the transformed likelihood

functions of background speakers D2(X), . . . , D5(X) in the same manner as was ap-

plied to L2(X), . . . , L5(X). Note that the factor 1/B in D4(X) is not accounted for

in the decision threshold as for L4(X) in (6.8). The effectiveness of the score in (6.30)

using the arctan function will be shown in the next chapter. It should again be noted

that the NC-based versions of these scores are derived following the method in (6.16).

To apply these scores to VQ-based speaker verification systems, likelihood func-

tions should be changed to VQ distortions as shown for the score D1(X) in (6.11)

and (6.12).

Page 111: Fuzzy Approaches to Speech and Speaker Recognition

6.5 Summary and Conclusion 96

6.5 Summary and Conclusion

A fuzzy approach to speaker verification has been proposed in this chapter. Using

fuzzy set theory, fuzzy membership scores are proposed as scores more general than

current likelihood scores. The likelihood transformation and 7 new scores have been

proposed. This fuzzy approach also leads to a noise clustering-based version for all

scores, which improves speaker verification performance markedly. Using the arctan

function in computing the score illustrates a theoretical extension for normalisation

methods, where not only the logarithm function but also other functions can be used.

Page 112: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 7

Evaluation Experiments and

Results

This chapter presents experiments performed to evaluate proposed models for speech and

speaker recognition as well as proposed normalisation methods for speaker verification.

Three speech data used in evaluation experiments are the TI46, the ANDOSL and the

YOHO corpora. Isolated word recognition experiments were performed on the word sets—

E set, 10-digit set, 10-command set and 46-word set—taken from the TI46 corpus using

conventional, fuzzy and hard hidden Markov models. Speaker identification and verifica-

tion experiments were performed on 16 speakers of TI46, 108 speakers of ANDOSL and

138 speakers of YOHO using conventional, fuzzy and hard Gaussian mixture models as

well as vector quantisation. Experiments demonstrate that fuzzy models and their noise

clustering versions outperform conventional models. Hard hidden Markov models were

also obtained good results.

7.1 Database Description

7.1.1 The TI46 database

The TI46 corpus was designed and collected at Texas Instruments (TI). The speech

was produced by 16 speakers, 8 females and 8 males, labelled f1-f8 and m1-m8 respec-

tively, consisting of two vocabularies—TI-20 and TI-alphabet. The TI-20 vocabulary

contains the ten digits from 0 to 9 and ten command words: enter, erase, go, help,

97

Page 113: Fuzzy Approaches to Speech and Speaker Recognition

7.1 Database Description 98

no, rubout, repeat, stop, start, and yes. The TI-alphabet vocabulary contains the

names of the 26 letters of the alphabet from a to z. For each vocabulary item, each

speaker produced 10 tokens in a single training session and another two tokens in each

of 8 testing sessions. The words in the TI-20 vocabulary are highly discriminable,

with the majority of confusions occurring between go and no. By comparison, the

TI-alphabet is a much more difficult vocabulary since it contains several confusable

subsets of letters, such as the E-set {b, c, d, e, g, p, t, v, z} and the A-set {a, j, k}. The

TI-20 vocabulary is a good choice because it has been used for other tests and there-

fore can serve as a standard benchmark of the performance [Syrdal et al. 1995]. The

corpus was sampled at 12500 samples per second and 12 bits per sample.

7.1.2 The ANDOSL Database

The Australian National Database of Spoken Language (ANDOSL) [Millar et al. 1994]

corpus comprises carefully balanced material for Australian speakers, both Australian-

born and overseas-born migrants. The aim was to represent as many significant

speaker groups within the Australian population as possible. Current holdings are

divided into those from native speakers of Australian English (born and fully edu-

cated in Australia) and those from non-native speakers of Australian English (first

generation migrants having a non-English native language). A subset used for speaker

recognition experiments in this paper consists of 108 native speakers, divided into 36

speakers of General Australian English, 36 speakers of Broad Australian English, and

36 speakers of Cultivated Australian English comprising 6 speakers of each gender

in each of three age ranges (18-30, 31-45 and 46+). So there are total of 18 groups

of 6 speakers labelled “ijk′′, where i denotes f (female) or m (male), j denotes y

(young) or m (medium) or e (elder), and k denotes g (general) or b (broad) or c

(cultivated). For example, the group fyg contains 6 female young general Australian

English speakers. Each speaker contributed in a single session, 200 phonetically rich

sentences. The averaged duration for each sentence is approximately 4 sec. The

speech was recorded in an anechoic chamber using a B&K 4155 microphone and a

DSC-2230 VU-meter used as a preamplifier and was digitised direct to computer disk

using a DSP32C analog-to-digital converter mounted in a PC. All waveforms were

sampled at 20 kHz and 16 bits per sample. For the processing as telephone speech,

Page 114: Fuzzy Approaches to Speech and Speaker Recognition

7.2 Speech Processing 99

all waveforms were converted from 20 kHz to 8 kHz bandwidth. Low and high pass

cut-offs were set to 300 Hz and 3400 Hz.

7.1.3 The YOHO Database

The YOHO corpus was collected by ITT under a US government contract and was

designed for speaker verification systems in office environments with limited vocab-

ulary. There are 138 speakers, 108 males and 30 females. The vocabulary consists

of 56 two-digit numbers ranging from 21 to 97 pronounced as “twenty-one”, “ninety-

seven”, and spoken continuously in sets of three, for example “36-45-89”, in each

utterance. There are four enrolment sessions per speaker, numbered 1 through 4,

and each session contains 24 utterances. There are also ten verification sessions,

numbered 1 through 10, and each session contains 4 utterances. All waveforms are

low-pass filtered at 3.8 kHz and sampled at 8 kHz.

7.2 Speech Processing

For the TI46 corpus, the data were processed in 20.48 ms frames (256 samples) at

a frame rate of 10 ms. Frames were Hamming windowed and preemphasised with

m = 0.9. For each frame, 46 mel-spectral bands of a width of 110 mel and 20 mel-

frequency cepstral coefficients (MFCC) were determined [Wagner 1996] and a feature

vector with a dimension of 20 was generated for individual frames.

For the ANDOSL and YOHO corpora, speech processing was performed using

HTK V2.0 [Woodland 1997], a toolkit for building HMMs. The data were processed

in 32 ms frames at a frame rate of 10 ms. Frames were Hamming windowed and

preemphasised with m = 0.97. The basic feature set consisted of 12th-order MFCCs

and the normalised short-time energy, augmented by the corresponding delta MFCCs

to form a final set of feature vector with a dimension of 26 for individual frames.

Page 115: Fuzzy Approaches to Speech and Speaker Recognition

7.3 Algorithmic Issues 100

7.3 Algorithmic Issues

7.3.1 Initialisation

It was shown in the literature that no significant difference in isolated-word recog-

nition and speaker identification was found by using different initialisation methods

[Rabiner et al. 1983, Reynolds 1995b]. So HMMs, GMMs, and VQ are respectively

initialised as follows:

• HMMs and fuzzy HMMs: Widely-used HMMs for training are left-to-right

HMMs as defined in (2.14), i.e. state sequences begin in state 1 and end in

state N, therefore the discrete HMM parameter set in our experiments was

initialised as follows

π1 = 1 πi = 0, 2 ≤ i ≤ N

aij = 0 1 ≤ i ≤ N, j < i or j > i + 1

aij = 0.5 1 ≤ i ≤ N, i ≤ j ≤ i + 1

bj(k) = 1/K 1 ≤ j ≤ N 1 ≤ k ≤ K (7.1)

Fuzzy membership functions were initialised with essentially random choices.

Gaussian parameters in continuous HMMs were initialised as those in GMMs.

• GMMs and fuzzy GMMs: mixture weights, mean vectors, covariance matrices

and fuzzy membership functions were initialised with essentially random choices.

Covariance matrices are diagonal, i.e. [Σk]ii = σ2k and [Σk]ij = 0 if i 6= j, where

σ2k, 1 ≤ k ≤ K are variances.

• VQ and fuzzy VQ: only fuzzy membership functions in fuzzy VQ models were

randomly initialised. VQ models were trained by the LBG algorithm—a widely

used version of the K-means algorithm [Linde et al. 1980], where starting from

the codebook size of 1, a binary split procedure is performed to double the

codebook size in end of several iterations.

7.3.2 Constraints on parameters during training

• HMMs and fuzzy HMMs: If the B matrix is left completely unconstrained, that

is, a finite training sequence may result in bj(k) = 0, then the probability of that

Page 116: Fuzzy Approaches to Speech and Speaker Recognition

7.3 Algorithmic Issues 101

sequence is equal to 0, and hence a recognition error must occur. This problem

was handled by using post-estimation constraints on the bj(k)’s of the form

bj(k) ≥ ɛ, where ɛ is a suitably chosen threshold value [Rabiner et al. 1983].

The number of states N = 6 was chosen for discrete HMMs based on an exper-

iment considering the recognition error in different states as shown in Figure

7.3.2

Figure 7.1: Isolated word recognition error (%) versus the number of state N for

the digit-set vocabulary, using left-to-right DHMMs, codebook size K = 16, TI46

database

• GMMs and fuzzy GMMs: Similarly, a variance limiting constraint was applied

to all GMMs using diagonal covariance matrices. This constraint places a min-

imum variance value σ2min on elements of all variance vectors in the GMM, that

is, σ2i = σ2

min if σ2i ≤ σ2

min [Reynolds 1995b]. In our experiments, ɛ = 10−5

and σ2min = 10−2. The chosen number of mixtures were 16, 32, 64, and 128 to

compare VQ models using the LBG algorithm.

• VQ and fuzzy VQ: Codebook sizes of 16, 32, 64, and 128 were chosen for all

experiments.

• Fuzzy parameters: Choosing the appropriate values of the degree of fuzziness

m and the degree of fuzzy entropy n was based on considering recognition error

for each database and each model. For example, in speaker identification using

Page 117: Fuzzy Approaches to Speech and Speaker Recognition

7.4 Isolated Word Recognition 102

FCM-VQ models and the TI46 database, an experiment was implemented to

consider the identification error via different values of the degree of fuzziness m,

where the codebook size was fixed. Results are shown in Figure 7.3.2. Therefore

Figure 7.2: Speaker Identification Error (%) versus the degree of fuzziness m using

FCM-VQ speaker models, codebook size K = 16, TI46 corpus

the value m = 1.2 was chosen for FCM-VQ models using the TI46 database.

In general, our experiments showed that the degree of fuzziness m depends on

models and weakly on speech databases. Suitable values were m = 1.2 for

FCM-VQ, m = 1.05 for FCM-GMMs and m = 1.2 for FCM-HMMs.

Similarly, suitable values for the degree of fuzzy entropy n were n = 1 for FE-

VQ, n = 1.2 for FE-GMMs and n = 2.5 for FE-HMMs. Figure 7.3.2 shows the

word recognition error versus the degree of fuzzy entropy n.

7.4 Isolated Word Recognition

The first procedure in building an isolated word recognition system is that training

a HMM for each word. The speech signals for training HMMs were converted into

feature vector sequences as in (2.1) using the LPC analysis mentioned in Section

2.1.3. For training discrete models, these vector sequences were used to generate a

VQ codebook by means of the LBG algorithm and were encoded into observation

Page 118: Fuzzy Approaches to Speech and Speaker Recognition

7.4 Isolated Word Recognition 103

Figure 7.3: Isolated word recognition error (%) versus the degree of fuzzy entropy n

for the E-set vocabulary, using 6-state left-to-right FE-DHMMs, codebook size of 16,

TI46 corpus

sequences by this VQ codebook. Observation sequences were used to train conven-

tional DHMMs, FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs

and H-DHMMs following reestimation algorithms in Section 2.3.4 on page 26, Section

3.4.2 on page 54, Section 4.2.1 on page 67 and Section 5.2.1 on page 80. For train-

ing continuous models, vector sequences were directly used as observation sequences

to train conventional CHMMs, FE-CHMMs, NC-FE-CHMMs, FCM-CHMMs, NC-

FCM-CHMMs and H-CHMMs following reestimation algorithms in Section 2.3.4 on

page 26, Section 3.4.3 on page 57, Section 4.2.2 on page 68 and Section 5.2.2 on page

80. The second procedure is that recognising unknown words. Assuming that we

have a vocabulary of M words to be recognised and that a HMM was trained for each

word, the speech signal of unknown word is converted into an observation sequence O

using the above codebook. The probabilities P (O|λi), i = 1, . . . ,M were calculated

using (2.26) and the recognised word is the word whose probability is highest. The

three data sets of the TI46 corpus used for training discrete models are the E set,

the 10-digit set, and the 10-command set. The whole 46 words were used to train

continuous models.

Page 119: Fuzzy Approaches to Speech and Speaker Recognition

7.4 Isolated Word Recognition 104

7.4.1 E set Results

Table 7.4.1 presents the experimental results for the recognition of the E set using 6-

state left-to-right HMMs in speaker-dependent mode. In the training phase, 10 train-

ing tokens (1 training session x 10 repetitions) of each word were used to train conven-

tional DHMMs, FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs

and H-DHMMs using VQ codebook sizes of 16, 32, 64, and 128. In the recogni-

tion phase, isolated word recognition was carried out by testing all 160 test tokens

(10 utterances x 8 testing sessions x 2 repetitions) against conventional DHMMs,

FE-DHMMs, NC-FE-DHMMs, FCM-DHMMs, NC-FCM-DHMMs and H-DHMMs of

each of 16 speakers. For continuous models, 736 test tokens (46 utterances x 8 testing

sessions x 2 repetitions) were tested against conventional CHMMs, FE-CHMMs and

FCM-CHMMs.

Codebook Conventional FE NC-FE FCM NC-FCM Hard

Size DHMM DHMM DHMM DHMM DHMM DHMM

16 54.54 41.51 41.48 51.97 42.74 44.51

32 39.41 34.38 34.36 37.46 33.46 36.72

64 33.84 29.92 29.88 30.54 27.87 30.89

128 33.98 31.28 31.25 32.27 31.85 33.07

Table 7.1: Isolated word recognition error rates (%) for the E set

In general, the errors are very different for the codebook size of 16, where the

highest error is for conventional models. This can be interpreted that the information

obtained from training data was lost by using small size codebooks and then not

sufficient for training conventional DHMMs. As discussed before, with a suitable

degree of fuzzy entropy or fuzziness, fuzzy models can reduce the errors due to this

problem. The average recognition error reductions by FE-DHMMs, NC-FE-DHMMs,

FCM-DHMMs, and NC-FCM-DHMMs in comparison with conventional DHMMs are

(54.54 − 41.51) = 13.03%, (54.54 − 41.23) = 13.31%, (54.54 − 51.97)% = 2.57%,

and (54.54 − 43.74)% = 10.8%, respectively. As the codebook size increases, the

errors for all models are reduced as shown with the codebook sizes of 32 and 64. The

error difference between these models also decreases as the codebook size increases

Page 120: Fuzzy Approaches to Speech and Speaker Recognition

7.4 Isolated Word Recognition 105

since the information obtained from the training data is sufficient for conventional

DHMMs. However, the number of HMM parameters is proportional to the codebook

size because of the matrix B, therefore with the large codebook size, e.g. 128, the

training data is now insufficient for these HMMs and hence the errors for the codebook

size of 128 become larger than those for the codebook size of 64. We should pay

attention to results of hard DHMMs. This is surprising since hard models are often

performed worse than other models.

The results also show that FE models performed better than FCM models for

the E set. However for their noise clustering versions, NC-FE-DHMMs are slightly

better than FE-DHMMs whereas NC-FCM-DHMMs achieved significant results in

comparison with FCM-DHMMs. The lowest recognition error is 27.87% obtained by

NC-FCM-DHMMs using VQ codebook size of 64. This can be interpreted that the

decreasing speed of the exponential function in FE memberships is faster than that of

the inverse function in FCM memberships for long distances (see Figure 4.5 on page

74), so FCM models are more sensitive to outliers than FE models.

Experiments were in speaker-dependent mode, so recognition performance should

be evaluated in speaker-dependent results. Table 7.4.1 shows these speaker-dependent

recognition results using the codebook size of 16. Recognition error rates were sig-

nificantly reduced by using fuzzy models for female speakers labelled f1, f3, f7, and

male speakers labelled m3, m6, m7, m8. Hard models were also more effective than

conventional models for speakers labelled f3, f6, f7, m3, m7 and m8, but for speakers

labelled f2, m1 and m5, they were worse than conventional models. In general, results

are not very dependent on speakers in using noise clustering-based fuzzy models.

7.4.2 10-Digit&10-Command Set Results

Tables 7.4.2 and 7.4.2 present the experimental results for the recognition of the

10-digit set and the 10-command set using 6-state left-to-right models in speaker-

dependent mode. These sets do not include a highly confusable vocabulary as the

E set, therefore the recognition error rates are low for all models. In these two

experiments we can see that the results for FCM-DHMMs and NC-FCM-DHMMs or

FE-DHMMs and NC-FE-DHMMs produce very similar results. This is probably due

to the fact that clusters in the 10-digit and 10-command sets are better separated

Page 121: Fuzzy Approaches to Speech and Speaker Recognition

7.4 Isolated Word Recognition 106

than those in the E set.

7.4.3 46-Word Set Results

The whole vocabulary of 46 words in the TI46 corpus was used to train continuous

models. Based on the results for the 10-digit set and the 10-command set, recognition

performance on the 46-word set using noise-clustering-based fuzzy models is possibly

not much improved in comparison with fuzzy models. So in this experiment, only

the 3-state 2-mixture models and 5-state 2-mixture models including conventional

Conventional FE NC-FE FCM NC-FCM Hard

Speaker DHMM DHMM DHMM DHMM DHMM DHMM

f1 68.06 54.17 54.11 62.76 43.75 65.28

f2 43.75 47.22 47.15 47.25 47.22 44.44

f3 63.19 43.75 43.75 59.88 54.86 51.39

f4 50.00 31.25 31.24 48.35 39.58 34.72

f5 55.56 54.86 54.81 54.83 54.17 53.47

f6 54.17 29.86 29.86 47.97 28.47 33.33

f7 69.44 44.44 44.41 59.23 40.97 43.06

f8 49.31 31.94 31.92 44.25 31.94 40.28

m1 27.78 26.39 26.39 28.89 28.47 36.11

m2 52.08 29.17 29.15 49.86 38.19 31.25

m3 62.50 44.44 44.42 57.68 49.31 49.31

m4 44.29 37.14 37.05 47.67 46.43 35.71

m5 33.33 39.01 38.89 33.21 31.21 39.01

m6 63.57 50.71 50.71 61.59 52.14 48.57

m7 63.38 40.14 40.12 59.45 45.77 43.66

m8 72.22 59.72 59.69 68.64 51.39 62.50

Female 56.69 42.19 42.16 53.07 42.62 45.75

Male 52.39 40.84 40.80 50.87 42.86 43.27

Average 54.54 41.51 41.48 51.97 42.74 44.51

Table 7.2: Speaker-dependent recognition error rates (%) for the E set

Page 122: Fuzzy Approaches to Speech and Speaker Recognition

7.4 Isolated Word Recognition 107

Codebook Conventional FE NC-FE FCM NC-FCM Hard

Size DHMM DHMM DHMM DHMM DHMM DHMM

16 6.21 4.84 4.32 5.83 4.74 5.03

32 2.25 1.81 1.72 2.16 2.06 1.81

64 0.43 0.37 0.34 0.39 0.38 0.43

128 0.39 0.35 0.35 0.38 0.38 0.43

Table 7.3: Isolated word recognition error rates (%) for the 10-digit set

Codebook Conventional FE NC-FE FCM NC-FCM Hard

Size DHMM DHMM DHMM DHMM DHMM DHMM

16 15.74 6.45 6.32 13.78 13.74 9.07

32 4.36 3.64 3.52 3.88 3.76 3.50

64 2.43 2.27 2.24 2.28 2.28 2.18

128 1.65 1.45 1.43 1.60 1.58 1.73

Table 7.4: Isolated word recognition error rates (%) for the 10-command set

Page 123: Fuzzy Approaches to Speech and Speaker Recognition

7.5 Speaker Identification 108

Model Conventional CHMM FE-CHMM FCM-CHMM

3 states 2 mixtures 9.59 9.42 8.96

5 states 2 mixtures 7.19 7.15 6.57

Table 7.5: Isolated word recognition error rates (%) for the 46-word set

CHMMs, FE-CHMMs, and FCM-CHMMs were trained for each word. Table 7.4.3

presents recognition performance of these models.

In this experiment FCM-CHMMs were performed better than FE-CHMMs and

conventional CHMMs. Speaker-dependent results for the conventional CHMMs and

FCM-CHMMs are presented in Table 7.4.3. In comparison with conventional CHMMs,

recognition error rates for speakers labelled f2, f3, f4, m2, m3 and m6 were improved

in both cases using 3-state 2-mixture and 5-state 2-mixture models. There was only

a bad result for speaker labelled f8.

7.5 Speaker Identification

The TI46, the ANDOSL, and the YOHO corpora were used for speaker identification.

GMM-based speaker identification was performed by using conventional GMMs, FE-

GMMs, NC-FE-GMMs, FCM-GMMs and NC-FCM-GMMs. For VQ-based speaker

identification, FE-VQ, FCM-VQ and (hard) VQ codebooks were also used. The vector

sequences obtained after the LPC analysis of speech signals were used for training

speaker models. Reestimation formulas have been presented in previous sections such

as (2.30) page 30 for GMMs, (3.42) and (3.43) page 60 for FE-GMMs, (4.15) and

(4.16) page 70 for FCM-GMMs. For GMM-based speaker identification, testing was

performed by calculating the probabilities P (X|λi), i = 1, . . . ,M for the unknown X

with M speaker models using (2.31) and the recognised speaker is the speaker whose

probability is highest. For VQ-based speaker identification, testing was performed by

calculating the average distortion D(X,λi), i = 1, . . . ,M for the unknown X with

M speaker models using (6.12) on page 88 and the recognised speaker is the speaker

whose average distortion is lowest.

Page 124: Fuzzy Approaches to Speech and Speaker Recognition

7.5 Speaker Identification 109

7.5.1 TI46 Results

The vocabulary was the 10-command set. In the training phase, 100 training tokens

(10 utterances x 1 training session x 10 repetitions) of each speaker were used to train

conventional GMMs, FCM-GMMs and NC-FCM-GMMs of 16, 32, 64, and 128 mix-

tures for GMM-based speaker identification, and VQ, FE-VQ, NC-FE-VQ codebooks

of 16, 32, 64, 128 codevectors for VQ-based speaker identification. The degrees of

fuzzy entropy and fuzziness were chosen n = 1 and m = 1.06, respectively. Speaker

identification was carried out in text-independent mode by testing all 2560 test tokens

(16 speakers x 10 utterances x 8 testing sessions x 2 repetitions) against conventional

GMMs, FCM-GMMs and NC-FCM-GMMs of all 16 speakers for GMM-based speaker

identification, and against VQ, FE-VQ, NC-FE-VQ codebooks of all 16 speakers for

VQ-based speaker identification.

Figure 7.4: Speaker identification error rate (%) versus the number of mixtures for

16 speakers, using conventional GMMs, FCM-GMMs and NC-FCM-GMMs

The experimental results are plotted in Figure 7.5.1 for GMM-based speaker

identification. The identification error rate was improved by using 64-mixture, 128-

mixture FCM-GMMs and NC-FCM-GMMs. The identification error rates of FCM-

GMMs and NC-FCM-GMMs were about 2% and 3% lower than the error rate of

conventional GMMs using 64 mixtures [Tran and Wagner 1998]. Figure 7.5.1 shows

experimental results for VQ-based speaker identification. FE-VQ and NC-FE-VQ

were also obtained similar good results as FCM-GMMs ans NC-FCM-GMMs.

Page 125: Fuzzy Approaches to Speech and Speaker Recognition

7.5 Speaker Identification 110

Figure 7.5: Speaker identification error rate (%) versus the codebook size for 16

speakers, using VQ, FE-VQ and NC-FE-VQ codebooks

7.5.2 ANDOSL Results

For the ANDOSL corpus, the subset of 108 native speakers was used. For each

speaker, the set of 200 long sentences was divided into two subsets. The training

set consisting of 10 sentences numbered from 001 to 020 was used to train conven-

tional GMMs, FE-GMMs and FCM-GMMs of 32 mixtures. Very low error rates were

obtained for all models thus noise clustering-based fuzzy models have not been con-

sidered in this experiment. The test set consisting of 180 sentences numbered from

021 to 200 was used to test the above models. Speaker identification was carried

out by testing all 19440 tokens (180 utterances x 108 speakers) against conventional

GMMs, FE-GMMs and FCM-GMMs of all 108 speakers. The degree of fuzzy entropy

and fuzziness were respectively set to n = 1.01 for FE-GMMs and to m = 1.026

for FCM-GMMs. The experimental results are presented in Table 7.5.2 for speaker

groups mentioned in Section 7.1.2. Results for the ANDOSL corpus are better than

those for the TI46 corpus since a larger training data set was used. Female speakers

have higher error rates than male speakers. Elder speakers have lower error rates

than young and medium speakers. Broad speakers have the lowest error rates in

comparison with cultivated and general speakers.

Page 126: Fuzzy Approaches to Speech and Speaker Recognition

7.6 Speaker Verification 111

7.5.3 YOHO Results

For the experiments performed on the YOHO corpus, each speaker was modelled by

using 48 training tokens in enrolment sessions 1 and 2 only. Using all four enrolment

sessions, as done for example by [Reynolds 1995a], resulted in error rates that were

too low to allow meaningful comparisons between the different normalisation meth-

ods for speaker verification. Conventional GMMs, hard GMMs and VQ codebooks

were trained in the training phase. Speaker identification was carried out in text-

independent mode by testing all 8280 test tokens (138 speakers x 10 utterances x

4 sessions) against conventional GMMs, hard GMMs and VQ codebooks of all 138

speakers. Table 7.5.3 shows identification error rates for these models. As expected,

results show that hard GMMs performed better than VQ but worse than conventional

GMMs.

7.6 Speaker Verification

As discussed in Chapter 6, speaker verification performance depends on speaker mod-

els and on the normalisation method used to verify speakers. So experiments were

performed in three cases: 1) Using proposed speaker models and current normali-

sation methods, 2) Using conventional speaker models and proposed normalisation

methods, and 3) Using proposed speaker models and proposed normalisation meth-

ods. Three speech corpora—TI46, ANDOSL, and YOHO—were used and all exper-

iments were in text-independent mode. Speaker models for speaker verification are

also those for speaker identification. Verification rules in (6.1) and (6.2) on page

86 were used for similarity-based scores Li(X), i = 3, . . . , 6 and dissimilarity-based

scores Dj(X), j = 3, . . . , 9 in this chapter.

7.6.1 TI46 Results

The speaker verification experiments were performed on the TI46 corpus to evaluate

proposed speaker models. Conventional GMMs, FCM-GMMs and NC-FCM-GMMs

of 16, 32, 64, 128 mixtures and VQ, FE-VQ, NC-FE-VQ codebooks of 16, 32, 64, 128

codevectors in speaker identification were used. The normalisation method is L3(X)

Page 127: Fuzzy Approaches to Speech and Speaker Recognition

7.6 Speaker Verification 112

in (6.8) page 88 for GMM-based speaker verification, and is D1(X) in (6.11) page

88 for VQ-based speaker verification. Since the TI46 corpus has a small speaker set

consisting of 8 female and 8 male speakers, therefore for a verification experiment, each

speaker was used as a claimed speaker with the remaining speakers (including 7 same-

gender background speakers) acting as impostors and rotating through all speakers.

So the total number of claimed test utterances and impostor test utterances are 2560

(16 claimed speakers x 10 test utterances x 8 sessions x 2 repetitions) and 38400 ((16

x 15) impostors x 10 test utterances x 8 sessions x 2 repetitions), respectively.

Figure 7.6: EERs (%) for GMM-based speaker verification performed on 16 speakers,

using GMMs, FCM-GMMs and NC-FCM-GMMs.

Experimental results are plotted in Figure 7.6.1 for GMM-based speaker veri-

fication. The lowest EER about 3% obtained for 128-mixture NC-FCM-GMMs.

Both FCM-GMMs and NC-FCM-GMMs show better results compared to conven-

tional GMMs, where the highest improvement of about 1% is found for 64-mixture

NC-FCM-GMMs. Figure 7.6.1 shows EERs for VQ-based speaker verification using

VQ, FE-VQ and NC-FE-VQ codebooks. Similarly, with codebook size of 16 and 32,

FE-VQ and NC-FE-VQ obtain high improvements compared to VQ codebooks.

7.6.2 ANDOSL Results

Experiments were performed on 108 speakers using each speaker as a claimed speaker

with 5 closest background speakers or 5 same-subgroup background speakers as in-

Page 128: Fuzzy Approaches to Speech and Speaker Recognition

7.6 Speaker Verification 113

Figure 7.7: EERs (%) for VQ-based speaker verification performed on 16 speakers,

using VQ, FE-VQ and NC-FE-VQ codebooks

dicated in Section 7.1.2 and 102 mixed-gender impostors (excluding 5 background

speakers) and rotating through all speakers. The total number of claimed test ut-

terances and impostor test utterances are 20520 (108 claimed speakers x 190 test

utterances) and 2093040 ((108 x 102) impostors x 190 test utterances), respectively.

In general, EERs obtained using the top 5 background speaker set are lower

than those obtained using the 5 subgroup background speaker set. For both 16

and 32 mixture GMMs, proposed normalisation methods D3(X), D4(X), . . . , D9(X),

especially D8(X) (FCM membership) and D9(X) (using the arctan fuction) pro-

duce lower EERs than current normalisation methods L3(X), . . . , L6(X). More inter-

esting, noise clustering-based normalisation methods D3nc(X), D4nc(X), . . . , D9nc(X)

Page 129: Fuzzy Approaches to Speech and Speaker Recognition

7.6 Speaker Verification 114

and L3nc(X), . . . , L6nc(X) produce better results compared to current and proposed

methods. The reason has been discussed in the last chapter, noise clustering-based

normalisation methods reduce false acceptance cases and hence the EER is also re-

duced. For 16-mixture GMMs, the geometric mean normalisation method L5(X) us-

ing 5 subgroup background speaker set produces the highest EER of 3.52% whereas

the proposed method D8nc(X) using top 5 background speaker set produces the lowest

EER of 1.66%. Similarly, for 32-mixture GMMs, L5(X) using 5 subgroup background

speaker set produces the highest EER of 3.26 % and D4nc(X) using top 5 background

speaker set produces the lowest EER of 1.26%.

7.6.3 YOHO Results

Experiments were performed on 138 speakers using each speaker as a claimed speaker

with 5 closest background speakers and 132 mixed-gender impostors (excluding 5

background speakers) and rotating through all speakers. The total number of claimed

test utterances and impostor test utterances are 5520 (138 claimed speakers x 40 test

utterances) and 728640 ((138 x 132) impostors x 40 test utterances), respectively.

Results are shown in two tables as follows.

Table 7.6.3 shows results for conventional GMMs and VQ codebooks to com-

pare EERs obtained by using current and proposed normalisation methods. Sim-

ilar to ANDOSL results, for 16, 32 and 64 mixture GMMs, proposed normalisa-

tion methods D3(X), D4(X), . . . , D9(X), especially D9(X) (using the arctan fuc-

tion) produced lower EERs than current normalisation methods L3(X), . . . , L6(X).

Noise clustering-based normalisation methods D3nc(X), D4nc(X), . . . , D9nc(X) and

L3nc(X), . . . , L6nc(X) also produced better results compared to current and proposed

methods. The current normalisation method L5(X) produced the highest EER of

4.47% for 16-mixture GMMs and the proposed method D9nc(X) produced the lowest

EER of 1.83% for 64-mixture GMMs. The better performances were also obtained

for proposed methods using VQ codebooks. The highest EER of 6.80% was obtained

using the current method L5(X) for VQ codebook size of 16 and the lowest EER of

2.74% was obtained using proposed methods D5nc(X) and D9nc(X).

Table 7.6.3 shows results for conventional GMMs, hard GMMs and VQ codebooks

to compare EERs obtained by using conventional and proposed models as well as cur-

Page 130: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 115

rent and proposed normalisation methods. Similar to speaker identification, conven-

tional GMMs produced better results than hard GMMs, and hard GMMs produced

better results than VQ codebooks. Results again show better results for proposed

normalisation methods. For hard GMMs, the current normalisation method L5(X)

produced the highest EER of 5.90% for 16-component hard GMMs and the proposed

method L5nc(X) produced the lowest EER of 3.10% for 32-component hard GMMs.

7.7 Summary and Conclusion

Isolated word recognition, speaker identification and speaker verification experiments

have been performed on the TI46, ANDOSL and YOHO corpora to evaluate proposed

models and proposed normalisation methods. Fuzzy entropy and fuzzy C-means

methods as well as their noise clustering-based versions and hard methods have been

applied to train discrete and continuous hidden Markov models, Gaussian mixture

models and vector quantisation codebooks.

In isolated word recognition, experiments have shown significant results for fuzzy

entropy and fuzzy C-means hidden Markov models compared to conventional hidden

Markov models performed on the highly confusable vocabulary consisting of the nine

English E-set letters: b, c, d, e, g, p, t, u, v and z. The lowest recognition error

of 27.87% was obtained for noise clustering-based fuzzy C-means 6-state left-to-right

discrete hidden Markov models using VQ codebook size of 64. The highest recognition

error of 54.54 % was obtained for conventinal 6-state left-to-right discrete hidden

Markov models using VQ codebook size of 16. For the 10-digit and 10-command sets,

the lowest errors of 0.34% and 1.43% were obtained for noise clustering-based fuzzy

entropy 6-state left-to-right discrete hidden Markov models using VQ codebook size

of 64 and 128, respectively. Fuzzy C-means continuous hidden Markov models have

shown good results in 46-word recognition with the lowest error of 6.57% obtained

for 5-state 2-mixture fuzzy C-means continuous hidden Markov models.

In speaker identification, experiments have shown good results for fuzzy entropy

vector quantisation codebooks and fuzzy C-means Gaussian mixture models. For 16

speakers of the TI46 corpus, noise clustering-based fuzzy C-means Gaussian mixture

models were obtained the lowest error of 12% using 128 mixtures. Noise clustering-

Page 131: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 116

based fuzzy entropy vector quantisation codebooks were also obtained the lowest error

of 10.13% using codebook size of 128. For 108 speakers of the ANDOSL corpus, the

lowest error of 0.65% was obtained for noise clustering-based fuzzy C-means Gaussian

mixture models using 32 mixtures. Results for 138 speakers of the YOHO corpus have

evaluated hard Gaussian mixture models as intermediate models in the reduction of

Gaussian mixture models to vector quantisation codebooks.

In speaker verification, proposed models and proposed normalisation methods

have been evaluated. For proposed models using current normalisation methods,

the lowest EERs obtained for 16 speakers of the TI46 corpus were about 3% us-

ing 128-mixture noise clustering-based fuzzy C-means Gaussian mixture models and

about 3.2% using 128-codevector noise clustering-based fuzzy entropy vector quanti-

sation codebooks. For proposed normalisation methods using conventional models,

the lowest EER obtained for 108 speakers of the ANDOSL corpus was 1.22% for the

noise clustering-based normalisation method using the fuzzy C-means membership

function as a similarity score for 32-mixture Gaussian mixture speaker models. The

lowest EER for 138 speakers of the YOHO corpus was 1.83% for the noise clustering-

based normalisation method using the arc tan function as a dissimilarity score for

64-mixture Gaussian mixture speaker models.

Page 132: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 117

Conventional CHMM FCM-CHMM

Speaker 3 states 2 mixtures 5 states 2 mixtures 3 states 2 mixtures 5 states 2 mixtures

f1 12.26 7.22 11.31 9.40

f2 8.03 6.53 5.03 5.85

f3 12.24 9.93 10.61 7.62

f4 7.07 7.48 5.99 3.81

f5 15.22 9.92 14.40 10.60

f6 10.19 7.07 8.97 7.47

f7 7.47 5.43 6.52 5.30

f8 6.39 4.49 7.89 4.90

m1 5.77 4.12 4.81 4.40

m2 3.94 2.17 2.72 1.09

m3 12.96 10.09 10.23 8.87

m4 8.90 5.98 9.04 4.45

m5 5.69 3.61 6.24 3.19

m6 13.91 11.85 13.36 11.02

m7 9.74 6.86 10.70 6.31

m8 13.59 12.23 15.49 10.73

Female 9.86 7.26 8.84 6.87

Male 9.32 7.12 9.08 6.27

Average 9.59 7.19 8.96 6.57

Table 7.6: Speaker-dependent recognition error rates (%) for the 46-word set

Page 133: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 118

Speaker GMM FE-GMM FCM-GMM

Groups 16 mixtures 32 mixtures 16 mixtures 32 mixtures 16 mixtures 32 mixtures

fyb 0.83 0.54 0.83 0.52 0.50 0.58

fyc 3.95 2.55 3.91 2.51 3.92 2.50

fyg 3.38 2.78 3.35 2.75 3.08 2.17

fmb 2.03 0.98 2.01 0.89 1.75 0.42

fmc 2.25 1.95 2.25 1.89 2.17 1.50

fmg 2.08 1.87 2.05 1.85 1.83 1.50

feb 0.75 0.08 0.71 0.05 0.00 0.00

fec 3.92 1.74 3.90 1.70 3.92 1.58

feg 1.17 0.38 1.13 0.35 0.42 0.33

myb 1.17 0.44 1.06 0.40 0.92 0.25

myc 0.42 0.29 0.37 0.28 0.25 0.17

myg 0.42 0.30 0.38 0.28 0.58 0.25

mmb 0.92 0.17 0.86 0.16 0.50 0.17

mmc 0.42 0.09 0.47 0.09 0.33 0.08

mmg 0.33 0.07 0.30 0.05 0.08 0.00

meb 0.25 0.00 0.21 0.00 0.17 0.00

mec 0.08 0.04 0.06 0.04 0.08 0.00

meg 0.08 0.08 0.08 0.06 0.08 0.17

female 2.26 1.43 2.24 1.39 1.95 1.18

male 0.45 0.16 0.42 0.15 0.33 0.12

young 1.70 1.15 1.65 1.12 1.54 0.99

medium 1.34 0.86 1.32 0.82 1.11 0.61

elder 1.04 0.39 1.02 0.37 0.78 0.35

broad 0.99 0.37 0.95 0.34 0.64 0.24

cultivated 1.84 1.11 1.83 1.09 1.78 0.97

general 1.24 0.91 1.22 0.89 1.01 0.74

Average 1.36 0.80 1.33 0.77 1.14 0.65

Table 7.7: Speaker identification error rates (%) for the ANDOSL corpus using con-

ventional GMMs, FE-GMMs and FCM-GMMs.

Page 134: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 119

GMM Hard GMM VQ

Speaker 16 mixtures 32 mixtures 16 components 32 components 16 vectors 32 vectors

Female 15.08 7.27 14.14 6.88 17.11 8.91

Male 11.88 5.09 13.94 9.25 15.90 7.36

Average 13.48 6.18 14.04 8.06 16.5 8.13

Table 7.8: Speaker identification error rates (%) for the YOHO corpus using conven-

tional GMMs, hard GMMs and VQ codebooks.

Page 135: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 120

Normalisation Top 5 Background Speaker Set 5 Subgroup Background Speaker Set

Methods 16-mixture GMM 32-mixture GMM 16-mixture GMM 32-mixture GMM

L3(X) 2.16 1.79 2.68 2.11

D3(X) 1.84 1.51 2.24 1.61

L3nc(X) 1.81 1.40 2.00 1.41

D3nc(X) 1.79 1.35 1.89 1.33

L4(X) 2.16 1.79 2.68 2.10

D4(X) 1.90 1.47 2.15 1.46

L4nc(X) 1.81 1.40 2.00 1.41

D4nc(X) 1.81 1.36 1.88 1.32

L5(X) 3.16 3.03 3.51 3.26

D5(X) 2.61 2.20 2.77 2.14

L5nc(X) 2.02 1.97 2.32 2.10

D5nc(X) 1.99 1.71 2.27 1.68

D7(X) 2.11 1.84 2.63 2.07

D7nc(X) 1.97 1.51 2.12 1.59

D8(X) 1.91 1.53 2.34 1.69

D8nc(X) 1.66 1.26 1.86 1.32

D9(X) 2.08 1.66 2.33 1.69

D9nc(X) 1.89 1.54 2.23 1.57

Table 7.9: EER Results (%) for the ANDOSL corpus using GMMs with different

background speaker sets. Rows in bold are the current normalisation methods ,

others are the proposed methods. The index “nc” denotes noise clustering-based

methods.

Page 136: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 121

Normalisation GMM VQ

Methods 16 mixtures 32 mixtures 64 mixtures 16 vectors 32 vectors 64 vectors

L3(X) 4.42 3.12 2.43 5.87 4.60 3.60

D3(X) 4.12 2.94 1.98 5.41 4.06 3.14

L3nc(X) 4.20 2.89 1.96 4.96 4.25 3.50

D3nc(X) 4.25 2.90 1.96 4.76 3.84 3.09

L4(X) 4.41 3.13 2.43 5.85 4.58 3.58

D4(X) 4.20 2.97 1.99 5.21 3.90 3.06

L4nc(X) 4.20 2.89 1.96 4.96 4.23 3.49

D4nc(X) 4.21 2.87 1.97 4.72 3.76 3.04

L5(X) 4.47 3.30 2.44 6.80 5.06 4.22

D5(X) 4.10 2.98 2.05 5.32 3.81 3.10

L5nc(X) 3.87 2.74 1.87 4.48 3.36 2.91

D5nc(X) 3.75 2.76 1.88 4.49 3.33 2.74

D7(X) 4.41 3.20 2.44 6.09 4.75 3.62

D7nc(X) 4.30 3.10 2.27 5.10 4.39 3.54

D8(X) 4.17 2.97 2.05 5.29 3.99 3.11

D8nc(X) 4.16 2.95 2.04 4.50 3.41 2.97

D9(X) 3.89 2.84 1.85 4.51 3.31 2.76

D9nc(X) 3.86 2.81 1.83 4.34 3.27 2.74

Table 7.10: Equal Error Rate (EER) Results (%) for the YOHO corpus. Rows in

bold are the current normalisation methods , others are the proposed methods.

Page 137: Fuzzy Approaches to Speech and Speaker Recognition

7.7 Summary and Conclusion 122

Normalisation GMM Hard GMM VQ

Methods 16 mixtures 32 mixtures 16 components 32 components 16 vectors 32 vectors

L3(X) 4.42 3.12 5.22 3.99 5.87 4.60

D3(X) 4.12 2.94 4.60 3.40 5.41 4.06

L3nc(X) 4.20 2.89 5.07 3.88 4.96 4.25

D3nc(X) 4.25 2.90 4.38 3.33 4.76 3.84

L4(X) 4.41 3.13 5.20 4.00 5.85 4.58

D4(X) 4.20 2.97 4.66 3.48 5.21 3.90

L4nc(X) 4.20 2.89 4.57 3.89 4.96 4.23

D4nc(X) 4.21 2.87 4.57 3.45 4.72 3.76

L5(X) 4.47 3.30 5.90 4.42 6.80 5.06

D5(X) 4.10 2.98 4.81 3.41 5.32 3.81

L5nc(X) 3.87 2.74 4.26 3.10 4.48 3.36

D5nc(X) 3.75 2.76 4.30 3.19 4.49 3.33

D7(X) 4.41 3.20 5.22 3.97 6.09 4.75

D7nc(X) 4.30 3.10 5.03 3.92 5.10 4.39

D8(X) 4.17 2.97 4.86 3.62 5.29 3.99

D8nc(X) 4.16 2.95 4.45 3.21 4.50 3.41

D9(X) 3.89 2.84 4.36 3.18 4.51 3.31

D9nc(X) 3.86 2.81 4.32 3.13 4.34 3.27

Table 7.11: Comparisons of EER Results (%) for the YOHO corpus using GMMs,

hard GMMs and VQ codebooks. Rows in bold are the current normalisation methods,

others are the proposed methods.

Page 138: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 8

Extensions of the Thesis

The fuzzy membership function in fuzzy set theory and the a posteriori probability in

the Bayes decision theory have similar roles. However the minimum recognition error rate

for a recogniser is obtained by the maximum a posteriori probability rule whereas the

maximum membership rule does not lead to such a minimum error rate. This problem

can be solved by using a recently developed branch of fuzzy set theory, namely possibility

theory. The development of this theory has led to a theory framework similar to that

of probability theory. Therefore, a possibilistic pattern recognition approach to speech

and speaker recognition may be developed which is powerful as the statistical pattern

recognition approach. In this chapter, possibility theory is introduced briefly and a small

application of this theory, namely a possibilistic C-means approach to speech and speaker

recognition is proposed. It is suggested that future research into the possibilistic pattern

recognition approach to speech and speaker recognition may be very promising.

8.1 Possibility Theory-Based Approach

8.1.1 Possibility Theory

Let us review the basis of the statistical pattern recognition approach in Section 2.3

page 20. The goal of a recogniser is to achieve the minimum recognition error rate.

The maximum a posteriori probability (MAP) decision rule is selected to implement

this goal. However, these probabilities are not known in advance and have to be

estimated from a training set of observations with known class labels. The Bayes

123

Page 139: Fuzzy Approaches to Speech and Speaker Recognition

8.1 Possibility Theory-Based Approach 124

decision theory thus effectively transforms the recogniser design problem into the

distribution estimation problem and the MAP rule is transformed to the maximum

likelihood rule. The distributions are usually parameterised in order to be practically

implementable. The main task is to determine the right parametric form of the

distributions from the training data.

For the fuzzy pattern recognition approach in Section 2.4 page 34, the maximum

membership rule is selected to solve the recogniser design problem. This rule is trans-

formed to the minimum distance rule in fuzzy cluster analysis. However, it has not

been shown that the implementation of these rules leads to the goal of the recog-

niser, i.e. the minimum recognition error rate as in the statistical pattern recognition

approach.

The fuzzy approaches proposed in this thesis have shown a solution for this prob-

lem. By defining the general distance as a decreasing function of the component

distribution, the minimum distance rule becomes the maximum likelihood rule. This

means that the minimum recognition error rate can be achieved in the proposed

approaches.

An alternative approach that is also based on fuzzy set theory is the use of pos-

sibility theory introduced by Zadeh [1978], where fuzzy variables are associated with

possibility distributions in a similar way to that in which random variables are as-

sociated with probability distributions. A possibility distribution is a representation

of knowledge and information. It can be said that probability theory is used to

deal with randomness and possibility is used to deal with vagueness and ambiguity

[Tanaka and Guo 1999]. Possibility theory is more concerned with the modelling of

partial belief which is due to incomplete data rather than that which is due to the

presence of random phenomena [Dubois and Prade 1988].

In possibilistic data analysis, the total error possibility is defined and the maximum

possibility rule is formulated. The possibility distributions are also parameterised and

the right parametric form can be determined from the training data. In a new ap-

plication of the possibilistic approach to operations research [Tanaka and Guo 1999],

an exponential possibility distribution that is similar to a Gaussian distribution has

been proposed. Similarly, applying the possibilistic approach to speech and speaker

recognition would be worth investigating.

Page 140: Fuzzy Approaches to Speech and Speaker Recognition

8.1 Possibility Theory-Based Approach 125

8.1.2 Possibility Distributions

A possibility distribution on a one-dimensional space is a fuzzy membership function

of a point x in a fuzzy set A and is denoted as ΠA(x). For a set of n numbers

h1, . . . , hn, let the h-level sets Ahi= {x|ΠA(x) ≥ hi} be conventional sets (intervals)

such that if h1 ≤ h2 ≤ . . . ≤ hn then Ah1 ⊇ Ah2 ⊇ . . . ⊇ Ahn . The distribution ΠA(x)

should satisfy the following conditions

• There exists an x such that ΠA(x) = 1 (normality),

• h-level sets of fuzzy numbers are convex (convexity),

• ΠA(x) is piecewise continuous (continuity).

The possibility function is a unimodal function. The possibility distribution on the

d-dimensional space is similarly defined. Let A be a fuzzy vector defined as

A = {(x1, . . . , xd)|x1 ∈ A1, . . . , xd ∈ Ad} (8.1)

where A1, . . . , Ad are fuzzy sets. Denoting x = (x1, . . . , xd)′, the possibility distribu-

tion of A can be defined by

ΠA(x) = ΠA1(x1) ∧ . . . ∧ ΠAd(xd) (8.2)

For example, an exponential possibility distribution on the d-dimensional space can

be described as

ΠA(x) = exp { − (x − m)′S−1A (x − m)} (8.3)

where m is a centre vector and SA is a symmetric positive definite matrix. The

parametric representation of the exponential possibility distribution is λ = (m,SA).

8.1.3 Maximum Possibility Rule

Consider the simplest case where two classes A and B are characterised by two possi-

bility distributions ΠA(x) and ΠB(x), respectively. The task is to classify the vector

x into A or B. Let uA(x) and uB(x) be the degrees of possibility to which x belongs

to A and B, respectively. The error possibilities that x belonging to A is assigned to

B and vice versa are denoted as

E(A → B) = maxx

uB(x)ΠA(x) E(B → A) = maxx

uA(x)ΠB(x) (8.4)

Page 141: Fuzzy Approaches to Speech and Speaker Recognition

8.2 Possibilistic C-Means Approach 126

The total error possibility E can be defined as

E = E(A → B) + E(B → A) (8.5)

It can further be shown that [Tanaka and Guo 1999]

E ≥ maxx

[uB∗(x)ΠA(x) + uA∗(x)ΠB(x)] (8.6)

where

uA∗(x) =

{1 ΠA(x) ≥ ΠB(x)

0 otherwiseuB∗(x) =

{1 ΠB(x) > ΠA(x)

0 otherwise(8.7)

Then we obtain the maximum possibility rule written as an if-then rule as follows

If ΠA(x) ≥ ΠB(x) then x belongs to A,

If ΠA(x) < ΠB(x) then x belongs to B. (8.8)

8.2 Possibilistic C-Means Approach

8.2.1 Possibilistic C-Means Clustering

As shown in the noise clustering method (see Section 2.4.7 on page 39), the FCM

method uses the probabilistic constraint that the memberships of a feature vector

xt across clusters must sum to one. It is meant to avoid the trivial solution of all

memberships being equal to zero. However, since the memberships generated by this

constraint are relative numbers, they are not suitable for applications in which the

memberships are supposed to represent typicality or compatibility with an elastic

constraint. A possibilistic C-means (PCM) clustering has been proposed to generate

memberships that have a typicality interpretation [Krishnapuram and Keller 1993].

Following the fuzzy set theory [Zadeh 1965], the membership uit = ui(xt) is the

degree of compatibility of the feature vector xt with cluster i, or the possibility of

xt belonging to cluster i. If the clusters represented by the clusters are thought of

as a set of fuzzy subsets defined over the domain of discourse X = {x1, . . . ,xt}, then

there should be no constraint on the sum of the memberships.

Let U = [uit] be a matrix whose elements are memberships of xt in cluster i,

i = 1, . . . , C, t = 1, . . . , T . Possibilistic C-partition space for X is the set of matrices

Page 142: Fuzzy Approaches to Speech and Speaker Recognition

8.2 Possibilistic C-Means Approach 127

U such that

0 ≤ uit ≤ 1 ∀i, t, max1≤i≤C

uit > 0 ∀t, 0 <T∑

t=1

uit < T ∀i

(8.9)

The objective function may be formulated as follows

Jm(U, λ; X) =C∑

i=1

T∑t=1

umit d

2it +

C∑i=1

ηi

T∑t=1

(1 − uit)m (8.10)

where U = {uit} is a possibilistic C-partition of X, m > 1, and ηi, i = 1, . . . , C

are suitable positive numbers. Minimising the PCM objective function Jm(U, λ; X)

in (8.10) demands the distances in the first term to be as low as possible and the

uit in the second term to be as large as possible, thus avoiding the trivial solution.

Parameters are estimated similar to those in the NC approach. Minimising (8.10)

with respect to uit gives

uit =1

1 +(

d2it

ηi

) 1m−1

(8.11)

Equation (8.11) defines a possibility distribution function Πi for cluster i over the do-

main of discourse consisting of all feature vectors xt ∈ X. The value of ηi determines

the relative degree to which the second term in the objective function is important

compared with the first. In general, ηi relates to the overall size and shape of cluster

i. In practice, the following definition works well

ηi = K0

T∑t=1

umit d

2it

/ T∑t=1

umit (8.12)

where typically K0 is chosen to be one.

8.2.2 PCM Approach to FE-HMMs

For the possibilistic C-Means (PCM) approach, based on (8.9) in Section 8.2.1, the

matrix U = [uijt] is defined as

0 ≤ uijt ≤ 1 ∀i, j, t, max1≤i≤N1≤j≤N

uijt > 0 ∀t, 0 <T∑

t=1

uiit < T ∀i

(8.13)

Page 143: Fuzzy Approaches to Speech and Speaker Recognition

8.2 Possibilistic C-Means Approach 128

The fuzzy likelihood function in the PCM approach is as follows

Ln(U, λ; O) = −T−1∑t=0

N∑i=1

N∑j=1

uijtd2ijt −

N∑i=1

N∑j=1

nij

T−1∑t=0

(uijt log uijt − uijt) (8.14)

where the degree of fuzzy entropy nij > 0 is dependent on the states. Since f(uijt) =

uijt log uijt−uijt is a monotonically decreasing function in [0, 1], maximising the fuzzy

likelihood Ln(U, λ; O) in (8.14) over U forces uijt to be as large as possible. By setting

the derivative of Ln(U, λ; O) with respect to uijt to zero, we obtain the algorithm for

the FE-DHMM in the PCM approach (PCM-FE-DHMM)

Fuzzy E-Step:

uijt = exp{−d2

ijt

nij

}= [P (O, st = i, st+1 = j|λ)]1/nij (8.15)

M-Step: The M-step is identical to the M-step of the FE-DHMM in Section

3.4.2.

Similarly, the FE-CHMM in the PCM approach (PCM-FE-CHMM) is as follows

Fuzzy E-Step:

uijt = exp{−d2

jkt

njk

}= [P (O, st = j,mt = k|λ)]1/njk (8.16)

M-Step: The M-step is identical to the M-step of the FE-CHMM in Section 3.4.3.

8.2.3 PCM Approach to FCM-HMMs

For the possibilistic C-Means (PCM) approach, based on (8.9) in Section 8.2.1 page

127, the matrix U = [uijt] is defined as

0 ≤ uijt ≤ 1 ∀i, j, t, max1≤i≤N1≤j≤N

uijt > 0 ∀t, 0 <T∑

t=1

uiit < T ∀i

(8.17)

The fuzzy objective function in the PCM approach is as follows

Jm(U, λ; O) =T−1∑t=0

N∑i=1

N∑j=1

umijtd

2ijt +

N∑i=1

N∑j=1

ηij

T−1∑t=0

(1 − uijt)m (8.18)

where ηij are suitable positive numbers. The second term forces uijt to be as large as

possible. By setting the derivative of Jm(U, λ; O) with respect to uijt to zero, we obtain

the algorithm for the FCM-DHMM in the PCM approach (PCM-FCM-DHMM)

Page 144: Fuzzy Approaches to Speech and Speaker Recognition

8.2 Possibilistic C-Means Approach 129

Fuzzy E-Step:

uijt =1

1 +(d2

ijt

ηij

)1/(m−1)(8.19)

where dijt is defined in (3.21).

M-Step: The M-step is identical to the M-step of the FCM-DHMM in Section

4.2.1.

Similarly, the FCM-CHMM in the PCM approach (PCM-FCM-CHMM) is as follows

Fuzzy E-Step:

ujkt =1

1 +(d2

jkt

ηjk

)1/(m−1)(8.20)

where djkt is defined in (3.30).

M-Step: The M-step is identical to the M-step of the FCM-CHMM in Section

4.2.2.

8.2.4 PCM Approach to FE-GMMs

The matrix U = [uit] is defined as

0 ≤ uit ≤ 1 ∀i, t, max1≤i≤N

uit > 0 ∀t, 0 <T∑

t=1

uit < T ∀i

(8.21)

The fuzzy likelihood function in the PCM approach is as follows

Ln(U, λ; X) = −T∑

t=1

K∑i=1

uitd2it −

K∑i=1

ni

T∑t=1

(uit log uit − uit) (8.22)

where the degree of fuzzy entropy ni > 0 is dependent on the clusters. The algorithm

for the FE-GMM in the PCM approach (PCM-FE-GMM) is as follows

Fuzzy E-Step:

uit = exp{−d2

it

ni

}= [P (xt, i|λ)]1/ni (8.23)

M-Step: The M-step is identical to the M-step of the FE-GMM in Section 3.5.1.

Page 145: Fuzzy Approaches to Speech and Speaker Recognition

8.2 Possibilistic C-Means Approach 130

8.2.5 PCM Approach to FCM-GMMs

The matrix U = [uit] is defined as

0 ≤ uit ≤ 1 ∀i, t, max1≤i≤N

uit > 0 ∀t, 0 <T∑

t=1

uit < T ∀i

(8.24)

The fuzzy objective function in the PCM approach is as follows

Jm(U, λ; O) =T∑

t=1

K∑i=1

umit d

2it +

K∑i=1

ηi

T∑t=1

(1 − uit)m (8.25)

where ni, i = 1, . . . , K are positive numbers. The algorithm for the FCM-GMM in

the PCM approach (PCM-FCM-GMM) is as follows

Fuzzy E-Step:

uit =1

1 +(

d2it

ηi

)1/(m−1)(8.26)

M-Step: The M-step is identical to the M-step of the FCM-GMM in Section

4.3.1.

8.2.6 Summary and Conclusion

The theory of possibility, in contrast to heuristic approaches, offers algorithms for

composing hypothesis evaluations which are consistent with axioms in a well-developed

theory. Possibility theory, rather than probability theory, relates to the perception of

degrees of evidence instead of degrees of likelihood. In any case, using possibilities

does not prevent us from using statistics in the estimation of membership functions

[De Mori and Laface 1980].

In the PCM approach, the membership values can be interpreted as possibility

values or degrees of typicality of vectors in clusters. The possibilistic C-partition

defines C distinct possibility distributions and the PCM algorithm can be used to

estimate possibility distributions directly from training data. The role of PCM clus-

tering and PCM models can be seen by comparing the following figures 8.1, 8.2 and

8.3 with the corresponding figures on pages 41, 65 and 75.

Page 146: Fuzzy Approaches to Speech and Speaker Recognition

8.2 Possibilistic C-Means Approach 131

Clustering Techniques

Memberships 0 ≤ uit ≤ 1, 0 <T∑

t=1

uit < T

❄ ❄ ❄ ❄

uit ∈ {0, 1}C∑

i=1

uit = 1C∑

i=1

uit < 1 max1≤i≤C

uit > 0

Parameter set

λ = {µ} HCM FCM NC PCM

❄ ❄ ❄

λ = {µ,Σ} Gustafson-Kessel

ExtendedNC

ExtendedPCM

λ = {µ,Σ, w} Gath-Geva

Figure 8.1: PCM Clustering in Clustering Techniques

FE Models

❄ ❄FE

Discrete ModelsFE

Continuous Models

❄❄ ❄ ❄❄ ❄FE

DHMMNC-FEDHMM

PCM-FEDHMM

FECHMM

NC-FECHMM

PCM-FECHMM

❄ ❄ ❄FE

GMMNC-FEGMM

PCM-FEGMM

❄ ❄ ❄FEVQ

NC-FEVQ

PCM-FEVQ

Figure 8.2: PCM Approach to FE models for speech and speaker recognition

Page 147: Fuzzy Approaches to Speech and Speaker Recognition

8.2 Possibilistic C-Means Approach 132

FCM Models

❄ ❄FCM

Discrete ModelsFCM

Continuous Models

❄❄ ❄ ❄❄ ❄FCM

DHMMNC-FCMDHMM

PCM-FCMDHMM

FCMCHMM

NC-FCMCHMM

PCM-FCMCHMM

❄ ❄ ❄FCMGMM

NC-FCMGMM

PCM-FCMGMM

❄ ❄ ❄FCMVQ

NC-FCMVQ

PCM-FCMVQ

Figure 8.3: PCM Approach to FCM models for speech and speaker recognition

Page 148: Fuzzy Approaches to Speech and Speaker Recognition

Chapter 9

Conclusions and Future Research

9.1 Conclusions

Fuzzy approaches to speech and speaker recognition have been proposed and exper-

imentally evaluated in this thesis. To obtain these approaches, the following basic

problems have been solved. First, the time-dependent fuzzy membership function

has been introduced to hidden Markov modelling to denote the degree of belong-

ing of an observation sequence to a state sequence. Second, a relationship between

modelling techniques and clustering techniques has been established by using a gen-

eral distance defined as a decreasing function of the component probability density.

Third, a relationship between fuzzy models and conventional models is also estab-

lished by introducing a new technique—fuzzy entropy clustering. Finally, since the

roles of the fuzzy membership function and the a posteriori probability in the Bayes

decision theory are quite similar, the maximum a posteriori rule can be generalised

to the maximum membership rule. With the above general distance, the use of the

maximum membership rule also achieves the minimum recognition error rate for a

speech and speaker recogniser.

Fuzzy entropy models are the first set of proposed models in the fuzzy modelling

approach. A parameter is introduced as the degree of fuzzy entropy n > 0. With

n → 0, we obtain hard models. With n = 1, fuzzy entropy models reduce to con-

ventional models in the maximum likelihood scheme. Thus, statistical models can

be viewed as special cases of fuzzy models. Fuzzy entropy hidden Markov models,

133

Page 149: Fuzzy Approaches to Speech and Speaker Recognition

9.1 Conclusions 134

fuzzy entropy Gaussian mixture models and fuzzy entropy vector quantisation have

all been proposed.

Fuzzy C-means models are the second set of proposed models in the fuzzy mod-

elling approach. A parameter is introduced as the degree of fuzziness m > 1. With

m → 1, we also obtain hard models. Fuzzy C-means hidden Markov models and

fuzzy C-means Gaussian mixture models have been respectively proposed.

Noise clustering is an interesting fuzzy approach to fuzzy entropy and fuzzy C-

means models. This approach is simple but robust and has very good experimental

evaluations.

In general, fuzzy entropy and fuzzy C-means models have a common advantage,

that is they obtain adjustable parameters n and m. When conventional models do

not work well because of the insufficient training data problem or the complexity of

speech data, such as the nine English E-set words, a suitable value of n or m can be

found to obtain better models.

Hard models are the third set of proposed models. These models are regarded

as a consequence of fuzzy models as fuzzy parameters n and m tend to their limit

values. Hard HMMs are the single-state sequence HMMs. Conventional HMMs using

the Viterbi algorithm can be regarded as “pretty” hard HMMs.

The fuzzy approach to speaker verification is an alternative fuzzy approach in this

thesis. Based on the use of the fuzzy membership function as the claimed speaker’s

score and consideration of the likelihood transformation, six fuzzy membership scores

and ten noise clustering-based scores have been proposed. Using the arctan function

in computing the score illustrates a theoretical extension for normalisation methods,

where not only the logarithm function but also other functions can be used.

Isolated word recognition, speaker identification and speaker verification experi-

ments have been performed on the TI46, ANDOSL and YOHO corpora to evaluate

proposed models and proposed normalisation methods. In isolated word recognition,

experiments have shown very good results for fuzzy entropy and fuzzy C-means hid-

den Markov models compared to conventional hidden Markov models performed on

the highly confusable vocabulary of the nine English E-set letters: b, c, d, e, g, p, t, u,

v and z. In speaker identification, experiments have shown good results for fuzzy en-

tropy vector quantisation codebooks and fuzzy C-means Gaussian mixture models. In

Page 150: Fuzzy Approaches to Speech and Speaker Recognition

9.2 Directions for Future Research 135

speaker verification, experiments have shown better results for the proposed normal-

isation methods, especially for the noise clustering-based methods. With 2,093,040

test utterances for each ANDOSL result and 728,640 test utterances for each YOHO

result, these evaluation experiments are sufficiently reliable.

9.2 Directions for Future Research

Several directions for future research have been suggested which may extend or aug-

ment the work in this thesis. These are:

• Possibilistic Pattern Recognition Approach: As shown in Chapter 8, this

approach would be worth investigating. A possibility theory framework to re-

place the Bayes decision theory framework for the minimum recognition error

rate task of a recogniser looks very promising.

• Fuzzy Entropy Clustering: will be further considered in both theoretical and

experimental aspects such as the local minima, convergence, cluster validity,

cluster analysis and classifier design in pattern recognition.

• Fuzzy Approach to Discriminative Methods: The fuzzy approaches pro-

posed in this thesis are based on maximum likelihood-based methods. Since

discriminative methods such as maximum mutual information and generalised

probabilistic descent are also effective methods, finding a fuzzy approach to

these methods should be studied.

• Large Vocabulary Speech Recognition: The speech recognition experi-

ments in this thesis were isolated word recognition experiments on small vocab-

ularies. Therefore, to obtain a better evaluation for the proposed fuzzy models,

continuous speech recognition experiments on large vocabularies should be car-

ried out.

• Likelihood Transformations: Since speaker verification has many important

applications, other likelihood transformations should be studied to find more

effective normalisation methods for speaker verification.

Page 151: Fuzzy Approaches to Speech and Speaker Recognition

Bibliography

[Abramson 1963] N. Abramson, Information Theory and Coding, McGraw Hill, 1963.

[Allerhand 1987] M. Allerhand, Knowledge-Based Speech Pattern Recognition, Kogan Page Ltd,

London, 1987.

[Ambroise et al. 1997] C. Ambroise, M. Dang and G. Govaert, “Clustering of Spatial Data by the

EM Algorithm”, in A. Soares, J. Gomez-Hernandez and R. Froidevaux (eds), geoENV I - Geo-

statistics for Environmental Applications, vol. 9 of Quantitative Geology and Geostatistics,

Kluwer Academic Publisher, pp. 493-504, 1997.

[Ambroise and Govaert 1998] C. Ambroise and G. Govaert, “Convergence Proof of an EM-Type

Algorithm for Spatial Clustering”, Pattern Reccognition Letters, vol. 19, pp. 919-927, 1998.

[Atal 1974] B. S. Atal, “Effective of Linear Prediction Characteristics of Speech Wave for Automatic

Speaker Identification and Verification”, J. Acoust. Soc. Am., vol. 55, pp. 1304-1312, 1974.

[AT&T’s web site] http://www.research.att.com

[Bakis 1976] R. Bakis, “Continuous Speech Word Recognition via Centisecond Acoustic States”, in

Proc. ASA Meeting (Washington, DC), April, 1976.

[Banon 1981] G. Banon, “Distinction Between Several Subsets of Fuzzy Measures”, Fuzzy Sets and

Systems, vol. 5, pp. 291-305, 1981.

[Baum 1972] L. E. Baum, “An inequality and associated maximisation technique in statistical esti-

mation for probabilistic functions of a Markov process”, Inequalities, vol. 3, pp. 1-8, 1972.

[Baum and Sell 1968] L. E. Baum and G. Sell, “Growth transformations for functions on manifolds”,

Pacific J. Maths., vol. 27, pp. 211-227, 1968.

[Baum and Eagon 1967] L. E. Baum and J. A. Eagon, “An inequality with applications to statistical

estimation for probabilistic functions of a Markov process and to a model for Ecology”, Bull.

Amer. Math. Soc., vol. 73, pp. 360-363, 1967.

136

Page 152: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 137

[BBN’s web site] http://www.gte.com/AboutGTE/gto/bbnt/speech/research/technologies/index.html.

[Bellegarda 1996] J. R. Bellegarda, “Context dependent vector quantization for speech recognition”,

chapter 6 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui

Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 133-158,

1996.

[Bellegarda and Nahamoo 1990] J. R. Bellegarda and D. Nahamoo, “Tied mixture continuous pa-

rameter modelling for speech recognition”, in IEEE Trans. Acoustics, Speech, Signal Proc.,

vol. 38, pp. 2033-2045, 1990.

[Bellman et al. 1966] R. Bellman, R. Kalaba and L. A. Zadeh, “Abstraction and Pattern Recogni-

tion”, J. Math. Anal. Appl., vol. 13, pp. 1-7, 1966.

[Bezdek et al. 1998] J. C. Bezdek, T. R. Reichherzer, G. S. Lim and Y. Attikiouzel, “Multiple-

prototype classifier design”, IEEE Trans. Syst. Man Cybern., vol. 28, no. 1, pp. 67-79, 1998.

[Bezdek 1993] J. C. Bezdek, “A review of probabilistic, fuzzy and neural models for pattern recog-

nition”, J. Intell. and Fuzzy Syst., vol. 1, no. 1, pp. 1-25, 1993.

[Bezdek and Pal 1992] J. C. Bezdek and S. K. Pal, Fuzzy Models for Pattern Recognition, IEEE

Press, 1992.

[Bezdek 1990] J. C. Bezdek, “A Convergence Theorem for the Fuzzy ISODATA Clustering Algo-

rithms”, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-2, no.1, pp. 1-8, January

1990.

[Bezdek 1981] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum

Press, New York and London, 1981.

[Bezdek and Castelaz 1977] J. C. Bezdek and P. F. Castelaz, “Prototype classification and feature

selection with fuzzy sets”, IEEE Trans. Syst. Man Cybern., vol. SMC-7, no. 2, pp. 87-92,

1977.

[Bezdek 1974] J. C. Bezdek, “Cluster validity with fuzzy sets”, J. Cybern., vol. 3, no. 3, pp. 58-72,

1974.

[Bezdek 1973] J. C. Bezdek, Fuzzy mathematics in Pattern Classification, Ph.D. thesis, Applied

Math. Center, Cornell University, Ithaca, 1973.

[Booth and Hobert 1999] J. G. Booth and J. P. Hobert, “Maximizing Generalized Linear Mixed

Model Likelihoods with an Automated Monte Carlo EM algorithm”, J. Roy. Stat. Soc., Ser.

B, 1999 (to appear).

Page 153: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 138

[Cambridge’s web site] http://svr-www.eng.cam.ac.uk/

[Campbell 1997] J. P. Campbell, “Speaker Recognition: A Tutorial”, in Special issue on Automated

biometric Syst., Proc. IEEE, vol. 85, no. 9, pp. 1436-1462, 1997.

[Chou et al. 1989] P. Chou, T. Lookabaugh and R. Gray, “Entropy-constrained vector quantisa-

tion”, IEEE Trans. Acoustic, Speech, and Signal Processing, vol. ASSP-37, pp. 31-42, 1989.

[Chou and Oh 1996] H. J. Choi and Y. H. Oh, “Speech recognition using an enhanced FVQ based on

codeword dependent distribution normalization and codeword weighting by fuzzy objective

function”, in Proceedings of the International Conference on Spoken Language Processing

(ICSLP), vol. 1, pp. 354-357, 1996.

[Cover and Hart 1967] T. M. Cover and P. E. Hart, “Nearest neighbour pattern classification”,

IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, 1967.

[CSELT’s web site] http://www.cselt.it/

[Dang and Govaert 1998] M. Dang and G. Govaert, “Spatial Fuzzy Clustering using EM and Markov

Random Fields”, J. Syst. Research & Inform. Sci., vol. 8, pp. 183-202, 1998.

[Das and Picheny 1996] S. K. Das and M. A. Picheny, “Issues in practical large vocabulary isolated

word recognition: the IBM Tangora system”, chapter 19 in Automatic Speech and Speaker

Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K.

Paliwal, Kluwer Academic Publishers, USA, pp. 457-480, 1996.

[Dave and Krishnapuram 1997] R. N. Dave and R. Krishnapuram, “Robust clustering methods: a

unified view”, IEEE Trans. Fuzzy Syst., vol. 5, no.2, pp. 270-293, 1997.

[Dave and Bhaswan 1992] R. N. Dave and K. Bhaswan, “Adaptive fuzzy c-shells clustering and

detection of ellipses”, IEEE Trans. Neural Networks, vol. 3, pp. 643-662, May 1992.

[Dave 1991] R. N. Dave, “Characterization and detection of noise in clustering”, Pattern Recognition

Lett., vol. 12, no. 11, pp. 657-664, 1991.

[Dave 1990] R. N. Dave, “Fuzzy-shell clustering and applications to circle detection in digital im-

ages”, Int. J. General Systems, vol. 16, pp. 343-355, 1990.

[De Luca and Termini 1972] A. de Luca, S. Termini, “A definition of a nonprobabilistic entropy in

the setting of fuzzy set theory”, Inform. Control, vol. 20, pp. 301-312, 1972.

[De Mori and Laface 1980] R. De Mori and P. Laface, “Use of fuzzy algorithms for phonetic and

phonemic labeling of continuous speech”, IEEE trans. Pattern Anal. Machine Intell., vol.

PAMI-2, no. 2, pp. 136-148, March 1980.

Page 154: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 139

[Dempster et al. 1977] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from

Incomplete Data via the EM algorithm”, J. Roy. Stat. Soc., Series B, vol. 39, pp. 1-38, 1977.

[DRAGON’s web site] http://www.dragonsys.com/products/index.html

[Dubois and Prade 1988] D. Dubois and H. Prade, Possibility Theory; An Approach to Computer-

ized Processing of Uncertainty, Plenum Press, New York, 1988.

[Duda and Hart 1973] R. O. Duda and P. E. Hart, Pattern classification and scene analysis, John

Wiley & Sons, New York, 1973.

[Duisburg’s web site] http://www.uni-duisburg.de/e/Forschung/

[Dunn 1974] J. Dunn, “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact

Well-Separated Cluster”, J. Cybern., vol. 3, pp. 32-57, 1974.

[Doddington 1998] G. R. Doddington, “Speaker Recognition Evaluation Methodology – An

Overview and Perspective”, in Proc. Workshop on Speaker Recognition and its Commercial

and Forensic Applications (RLA2C), pp. 60-66, 1998.

[Fessler and Hero 1994] J. A. Fessler and A. O. Hero, “Space-alternating generalised EM algorithm”,

IEEE Trans. Signal Processing, vol. 42, pp. 2664-2677, 1994.

[Flanagan 1972] J. L. Flanagan, Speech Analysis, Synthesis, and Perception, 2nd ed., Springer-

Verlag, New York, 1972.

[Fogel 1995] D. B. Fogel, Evolutionary Computation, Toward A New Philosophy of Machine Intel-

ligence, IEEE Press, New York, 1995.

[Freitas 1998] J. F. G. Freitas, M. Niranjan and A. H. Gee, “The EM algorithm and neural net-

works for nonlinear state space estimation”, Technical Report CUED/F-INFENG/TR 313,

Cambridge University, 1998.

[Furui 1997] Sadaoki Furui, “Recent advances in speaker recognition”, Patter Recognition Lett., vol.

18, pp. 859-872, 1997.

[Furui 1996] Sadaoki Furui, “An Overview of Speaker Recognition Technology”, chapter 2 in Auto-

matic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K.

Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 31-56, 1996.

[Furui 1994] Sadaoki Furui, “An Overview of Speaker Recognition Technology”, in Proc. ESCA

Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994.

[Furui and Sondhi 1991a] Sadaoki Furui and M. Mohan Sondhi, Advances in Speech Signal Process-

ing, Marcel Dekker, Inc., New York, 1991.

Page 155: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 140

[Furui 1991b] Sadaoki Furui, “Speaker-independent and speaker-adaptive recognition techniques”,

in Advances in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi,

Marcel Dekker, Inc., New York, pp. 597-622, 1991.

[Furui 1989] Sadaoki Furui, Digital Speech, Processing, Synthesis, and Recognition, Marcel Dekker,

Inc., New York, 1989.

[Furui 1981] Sadaoki Furui, “Cepstral Analysis Techniques for Automatic Speaker Verification”,

IEEE Trans. Acoustic, Speech, and Signal Processing, vol. 29, pp. 254-272, 1981.

[Gath and Geva 1989] I. Gath and A. B. Geva, “Unsupervised optimal fuzzy clustering”, IEEE

Trans. Patt. Anal. Mach. Intell., PAMI vol. 11, no. 7, pp. 773-781, 1989.

[Ghahramani 1997] Z. Ghahramani, “Factorial Hidden Markov Models”, in Machine Learning, vol.

29, pp. 245-275, Kluwer Academic Publisher, 1997.

[Ghahramani 1995] Z. Ghahramani, “Factorial Learning and the EM Algorithm”, in Adv. Neural

Inform. Processing Syst. G. Tesauro, D.S. Touretzky and J. Alspector (eds.), vol. 7, pp.

617-624, MIT Press, Cambridge, 1995.

[Gish 1990] H. Gish, “Robust discrimination in automatic speaker identification”, in Proc. IEEE

Inter. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’90), pp. 289-292, 1990.

[Gish et al. 1991] H. Gish, M.-H. Siu and R. Rohlicek, “Segregation of speakers for speech recogni-

tion and speaker identification”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal

Processing (ICASSP’91), pp. 873-876, 1991.

[Goldberg 1989] D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning,

Addison-Wesley, 1989.

[Gravier and Chollet 1998] G. Gravier and G. Chollet, “Comparison of Normalization Techniques

for Speaker Verification”, in Proc. on Speaker Recognition and its Commercial and Forensic

Applications (RLA2C), pp. 97-100, 1998.

[Gustafson and Kessel 1979] D. E. Gustafson and W. Kessel, “Fuzzy clustering with a Fuzzy Co-

variance Matrix”, in in Proc. IEEE-CDC, (K. S. Fu, ed.), vol. 2, pp. 761-766, IEEE Press,

Piscataway, New Jersey, 1979.

[Harrington and Cassidy 1996] J. Harrington and S. Cassidy, Techniques in Speech Acoustics,

Kluwer Academic Publications, 1996.

[Hartigan 1975] J. Hartigan, Clustering Algorithms, Wiley, NewYork, 1975.

[Hathaway 1986] R. Hathaway, “Another interpretation of the EM algorithm for mixture distribu-

tion”, J. Stat. Prob. Lett., vol. 4, pp. 53-56, 1986.

Page 156: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 141

[Higgins et al. 1991] A. L. Higgins, L. Bahler and J. Porter, “Speaker Verification using Random-

nized Phrase Prompting”, Digital Signal Processing, vol. 1, pp. 89-106, 1991.

[Hoppner et al. 1999] F. Hoppner, F. Klawonn, R. Kruse and T. Runkler, Fuzzy Cluster Analysis

– Methods for classification, Data analysis and Image Recognition, John Wiley & Sons Ltd,

1999.

[Huang et al. 1996] X. Huang, A. Acero, F. Alleva, M. Huang, L. Jiang and M. Mahajan, “From

SPHINX-II to WHISPER: Making speech recognition usable”, chapter 20 in Automatic Speech

and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee, Frank K. Soong, and

Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 481-508, 1996.

[Huang et al. 1990] X. D. Huang, Y. Ariki and M. A. Jack, Hidden Markov Models For Speech

Recognition, Edinburgh University Press, 1990.

[Huang and Jack 1989] X. D. Huang and M. A. Jack, “Semi-Continuous Hidden Markov Models

For Speech Signal”, Computer, Speech and Language, vol. 3, pp. 239-251, 1989.

[IBM’s web site] http://www-4.ibm.com/software/speech/

[Jaynes 1957] E. T. Jaynes, “Information theory and statistical mechanics”, Phys. Rev., vol. 106,

pp. 620-630, 1957.

[Juang 1998] B.-H. Juang, “The Past, Present, and Future of Speech Processing”, IEEE Signal

Processing Magazine, vol. 15, no. 3, pp. 24-48, 1998.

[Juang et al. 1996] B.-H. Juang, W. Chou and C.-H. Lee, “Statistical and discriminative methods

for speech recognition”, chapter 5 in Automatic Speech and Speaker Recognition, Advanced

Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic

Publishers, USA, pp. 109-132, 1996.

[Juang and Katagiri 1992] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error

classification”, IEEE Trans. Signal processing, SP-40, no. 12, pp. 3043-3054, 1992.

[Juang and Rabiner 1991] B.-H. Juang and L. R. Rabiner, “Issues in using hidden Markov models

for speech and speaker recognition”, in Advances in Speech Signal Processing, edited by

Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 509-554, 1991.

[Juang 1985] B.-H. Juang, “Maximum likelihood estimation for multivariate observations of Markov

sources”, AT&T Technical Journal, vol. 64, pp. 1235-1239, 1985.

[Kasabov 1998] N. Kasabov, “A framework for intelligent conscious machines and its application to

multilingual speech recognition systems”, in Brain-like computing and intelligent information

systems, S. Amari and N. Kasabov eds. Singapore, Springer Verlag, pp. 106-126, 1998.

Page 157: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 142

[Kasabov et al. 1999] N. Kasabov, R. Kozma, R. Kilgour, M. Laws, J. Taylor, M. Watts and

A. Gray, “Hybrid connectionist-based methods and systems for speech data analysis and

phoneme-based speech recognition” in Neuro-Fuzzy Techniques for Intelligent Information

Processing, N. Kasabov and R.Kozma, eds. Heidelberg, Physica Verlag, 1999.

[Katagiri and Juang 1998] S. Katagiri and B.-H. Juang, “Pattern Recognition using a family of

design algorithms based upon the generalised probabilistic descent method”, invited paper in

Proc. of the IEEE, vol. 86, no. 11, pp. 2345-2373, 1998.

[Katagiri et al. 1991] S. Katagiri, C.-H. Lee and B.-H. Juang, “New discriminative training algo-

rithms based on the generalised descent method”, in Proc. of IEEE Workshop on neural

networks for signal processing, pp. 299-308, 1991.

[Keller et al. 1985] J. M. Keller, M. R. Gray and J. A. Givens, “A fuzzy k-nearest neighbor algo-

rithm”, IEEE Trans. Syst. Man Cybern., vol. SMC-15, no. 4, pp. 580-585, 1985.

[Kewley-Port 1995] Diane Kewley-Port, “Speech recognition”, chapter 9 in Applied Speech Technol-

ogy, edited by A. Syrdal, R. Bennett and S. Greenspan, CRC Press, Inc, USA, 1995.

[Koo and Un 1990] M. Koo and C. K. Un, “Fuzzy smoothing of HMM parameters in speech recog-

nition”, Electronic Letters, vol. 26, pp. 7443-7447, 1990.

[Kosko 1992] B. Kosko, Neural Networks and Fuzzy Systems, Englewood Cliffs, NJ:Prentice-Hall,

1992.

[Krishnapuram and Keller 1993] R. Krishnapuram and J. M. Keller, “A possibilistic approach to

clustering”, IEEE Trans. Fuzzy Syst., vol. 1, pp. 98-110, 1993.

[Krishnapuram et al. 1992] R. Krishnapuram, O. Nasraoui and H. Frigui, “Fuzzy c-spherical shells

algorithm: A new approach”, IEEE Trans. Neural Networks, vol. 3, no. 5, pp. 663-671, 1992.

[Kulkarni 1995] V. G. Kulkarni, Modeling and analysis of stochastic systems, Chapman & Hall, UK,

1995.

[Kuncheva and Bezdek 1997] L. I. Kuncheva and J. C. Bezdek, “A fuzzy generalised nearest proto-

type classifier”, in Proc. the 7th IFSA World Congress, Prague, Czech, vol. III, pp. 217-222,

1997.

[Kunzel 1994] H. J. Kunzel, “Current approaches to forensic speaker recognition”, in Proc. ESCA

Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 135-141,

1994.

[Le et al. 1999] T. V. Le, D. Tran and M. Wagner, “Fuzzy evolutionary programming for hidden

Markov modelling in speaker identification”, in Proc. the Congress on Evolutionary Compu-

tation 99, Washington DC, pp. 812-815, 1999.

Page 158: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 143

[LIMSI’s web site] http://www.limsi.fr/Recherche/TLP/reco/2pg95-sv/2pg95-sv.html

[C.-H. Lee et al. 1996] C.-H. Lee, F. K. Soong and K. K. Paliwal, Automatic speech and speaker

recognition, Advanced topics, Kluwer Academic Publishers, USA, 1996.

[C.-H. Lee and Gauvain 1996] C.-H. Lee and J.-L. Gauvain, “Bayesian adaptive learning and MAP

estimation of HMM”, chapter 4 in Automatic Speech and Speaker Recognition, Advanced

Topics, edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic

Publishers, USA, pp. 83-108, 1996.

[Lee and Leekwang 1995] K. M. Lee and H. Leekwang, “Identification of λ-Fuzzy Measure by Ge-

netic Algorithms”, Fuzzy Sets Syst., vol. 75, pp. 301-309, 1995.

[K.-F. Lee and Alleva 1991] K.-F. Lee and Fil Alleva, “Continuous speech recognition”, in Advances

in Speech Signal Processing, edited by Sadaoki Furui and M. Mohan Sondhi, Marcel Dekker,

Inc., New York, pp. 623-650, 1991.

[Leszczynski et al. 1985] K. Leszczynski, P. Penczek and W. Grochulski, “Sugeno’s Fuzzy Measure

and Fuzzy Clustering”, Fuzzy Sets Syst., vol. 15, pp. 147-158, 1985.

[Levinson et al. 1983] S. E. Levinson, L. R. Rabiner and M. M. Sondhi, “An introduction to the

application of the theory of Probabilistic functions of a Markov process to automatic speech

recognition”, in The Bell Syst. Tech. Journal, vol. 62, no. 4, 1983, pp 1035-1074, 1983.

[Li and Mukaidono 1999] R.-P. Li and M. Mukaidono, “Gaussian clustering method based on

maximum-fuzzy-entropy interpretation”, Fuzzy Sets and Systems, vol. 102, pp. 253-258, 1999.

[Linde et al. 1980] Y. Linde, A. Buzo and R. M. Gray, “An Algorithm for Vector Quantization”,

IEEE Trans. Communications, vol. 28, pp. 84-95, 1980.

[Liu et al. 1996] C. S. Liu, H. C. Wang and C.-H. Lee, “Speaker Verification using Normalization

Log-Likelihood Score”, IEEE Trans. Speech and Audio Processing, vol. 4, pp. 56-60, 1980.

[Liu et al. 1998] C. Liu, D. B. Rubin and Y. N. Wu, “Parameter Expansion to Accelerate EM: the

PX-EM algorithm”, Biometrika, 1998 (to appear).

[Liu and Rubbin 1994] C. Liu and D. B. Rubin “The ECME algorithm: a simple extension of EM

and ECM with faster monotone convergence”, Biometrika, vol. 81, pp. 633-648, 1994.

[Markov and Nakagawa 1998a] K. P. Markov and S. Nakagawa, “Discriminative training of GMM

using a modified EM algorithm for speaker recognition”, in Proc. Inter. Conf. on Spoken

Language Processing (ICSLP’98), vol. 2, pp. 177-180, Sydney, Australia, 1998.

Page 159: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 144

[Markov and Nakagawa 1998b] K. P. Markov and S. Nakagawa, “Text-independent speaker recog-

nition using non-linear frame likelihood transformation”, Speech Communication, vol. 24, pp.

193-209, 1998.

[Matsui and Furui 1994] T. Matsui and S. Furui, “A new similarity normalisation method for

speaker verification based on a posteriori probability”, in Proc. ESCA Workshop on Au-

tomatic Speaker Recognition, Identification and Verification, pp. 59-62, 1994.

[Matsui and Furui 1993] T. Matsui and S. Furui, “Concatenated Phoneme Models for Text Variable

Speaker Recognition”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing

(ICASSP’93), pp. 391-394, 1993.

[Matsui and Furui 1992] T. Matsui and S. Furui, “Comparison of text-independent speaker recog-

nition methods using VQ-distortion and discrete/continuous HMMs”, in Proc. IEEE Inter.

Conf. on Acoustic, Speech, and Signal Processing (ICASSP’92), San Francisco, pp. II-157-160,

1992.

[Matsui and Furui 1991] T. Matsui and S. Furui, “A text-independent speaker recognition method

robust against utterance variations”, in Proc. IEEE Inter. Conf. on Acoustic, Speech, and

Signal Processing (ICASSP’91), pp. 377-380, 1991.

[McDermott and Katagiri 1994] E. McDermott and S. Katagiri, “Prototype-based MCE/GPD

training for various speech units”, Comp. Speech Language, vol. 8, pp. 351-368, 1994.

[Medasani and Krishnapuram 1998] S. Medasani and R. Krishnapuram, “Categorization of Image

Databases for Efficient Retrieval Using Robust Mixture Decomposition”, in Proc. IEEE Work-

shop on Content Based Access of Images and Video Libraries, IEEE Conference on Computer

Vision and Pattern Recognition, Santa Barbara, pp. 50-54, 1998.

[Meng and Dyk 1997] X. L. Meng and V. Dyk, “The EM algorithm An old folk song sung to a fast

new tune (with discussion)”, J. Roy. Stat. Soc., Ser. B, vol. 59, pp. 511-567, 1997.

[Meng and Rubin 1993] X. L. Meng and D. B. Rubin, “Maximum likelihood estimation via the

ECM algorithm: a general framework”, Biometrika, vol. 80, pp. 267-278, 1993.

[MIT’s web site] http://www.sls.lcs.mit.edu/sls/

[Millar et al. 1994] J. B. Millar, J. P. Vonwiller, J. M. Harrington and P. J. Dermody, “The Aus-

tralian National Database of Spoken Language”, in Proc. Inter. Conf. on Acoustic, Speech,

and Signal Processing (ICASSP’94), vol. 1, pp. 97-100, 1994.

[Nadas 1983] A. Nadas, “A decision theoretic formulation of atraining problem in speech recognition

and a comparison of training by unconditional versus conditional maximum likelihood”, IEEE

Trans. Signal Processing, vol. 31, no. 4, pp. 814-817, 1983.

Page 160: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 145

[Murofushi and Sugeno 1989] T. Murofushi and M. Sugeno, “An interpretation of Fuzzy Measure

and the Choquet Integral as an Integral with respect to a Fuzzy Measure”, Fuzzy Sets Syst.,

vol. 29, pp. 201-227, 1989.

[Normandin 1996] Y. Normandin, “Maximum mutual information estimation of hidden Markov

models”, chapter 3 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by

Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA,

pp. 57-82, 1996.

[Western Ontario’s web site] http://www.uwo.ca/nca/

[O’Shaughnessy 1987] Douglas O’Shaughnessy, Speech Communication, Addison-Wesley, USA,

1987.

[Ostendorf et al. 1997] M. Ostendorf, V. V. Digalakis and O. A. Kimball, “From HMM’s to segment

models: A unified view of stochastic modeling for speech recognition”, IEEE Trans. Speech

& Audio Processing, vol. 4, no. 5, pp. 360-378, 1997.

[Ostendorf 1996] M. Ostendorf, “From HMM’s to segment models: stochastic modeling for CSR”,

chapter 8 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui

Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 185-210,

1996.

[Otten and Ginnenken 1989] R. H. J. M. Otten and L. P. P. P. van Ginneken, The Annealing Algo-

rithm, Kluwer, Boston, 1989.

[Owens 1993] F. J. Owens, Signal Processing of Speech, McGraw-Hill, Inc., New York, 1993.

[Pal and Majumder 1977] S. K. Pal and D. D. Majumder, “Fuzzy sets and decision making ap-

proaches in vowel and speaker recognition”, IEEE Trans. Syst. Man Cybern., pp. 625-629,

1977.

[Paul 1989] D. B. Paul, “The Lincoln Robust Continuous Speech Recogniser,” Proc. ICASSP 89,

Glasgow, Scotland, pp. 449-452, 1989.

[Peleg 1980] S. Peleg, “A new probability relaxation scheme”, IEEE Trans. Patt. Anal. Mach. In-

tell., vol. 7, no. 5, pp. 617-623, 1980.

[Peleg and Rosenfeld 1978] S. Peleg and A. Rosenfeld, “Determining compatibility coefficients for

curve enhancement relaxation processes”, IEEE Trans. Syst. Man Cybern., vol. 8, no. 7, pp.

548-555, 1978

Page 161: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 146

[Rabiner et al. 1996] L. R. Rabiner, B. H. Juang and C. H. Lee, “An Overview of Automatic Speech

Recognition”, chapter 1 in Automatic Speech and Speaker Recognition, Advanced Topics,

edited by Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers,

USA, pp. 1-30, 1996.

[Rabiner and Juang 1993] L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition, Pren-

tice Hall PTR, USA, 1993.

[Rabiner 1989] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications speech

recognition”, in Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989.

[Rabiner and Juang 1986] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov

models”, IEEE Acoustic, Speech, and Signal Processing Society Magazine, vol. 3, no. 1, pp.

4-16, 1986.

[Rabiner et al. 1983] L. R. Rabiner, S. E. Levinson and M. M. Sondhi, “On the application of vector

quantisation and hidden Markov models to speaker-independent, isolated word recognition”,

The Bell System Technical Journal, vol. 62, no. 4, 1983, pp 1075-1105, 1983.

[Rabiner and Schafer 1978] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals,

Prentice Hall PTR, USA, 1978.

[Rezaee et al. 1998] M. R. Rezaee, B. P. F. Lelieveldt and J. H. C. Reiber, “A new cluster validity

index for the fuzzy c-means”, Patt. Rec. Lett. vol. 19, pp. 237-246, 1998.

[Reynolds 1995a] Douglas A. Reynolds, “Speaker identification and verification using Gaussian mix-

ture speaker models”, Speech Communication, vol. 17, pp. 91-108, 1995.

[Reynolds 1995b] Douglas A. Reynolds and Richard C. Rose, “Robust text-independent speaker

identification using Gaussian mixture models”, IEEE Trans. Speech and Audio Processing,

vol. 3, no. 1, pp. 72-83, 1995.

[Reynolds 1994] Douglas A. Reynolds, “Speaker identification and verification using Gaussian mix-

ture speaker models”, in Proc. ESCA Workshop on Automatic Speaker Recognition, Identifi-

cation and Verification, vol. 17, pp. 91-108, 1994.

[Reynolds 1992] Douglas A. Reynolds, A Gaussian Mixture Modeling Approach to Text-Independent

Speaker Identification, PhD thesis, Georgia Institute of Technology, USA, 1992.

[Rosenberg and Soong 1991] A. E. Rosenberg and Frank K. Soong, “Recent research in automatic

speaker recognition”, in Advances in Speech Signal Processing, edited by Sadaoki Furui and

M. Mohan Sondhi, Marcel Dekker, Inc., New York, pp. 701-740, 1991.

Page 162: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 147

[Rosenberg et al. 1992] A. E. Rosenberg, J. Delong, C.-H. Lee, B.-H. Juang and F. K. Soong, “The

use of cohort normalised scores for speaker verification”, in Proc. Inter. Conf. on Spoken

Language Processing (ICSLP’92), pp. 599-602, 1992.

[Ruspini 1969] E. H. Ruspini, “A new approach to clustering”, in Inform. Control, vol. 15, no. 1,

pp. 22-32, 1969.

[Sagayama 1996] S. Sagayama, “Hidden Markov network for precise acoustic modeling”, chapter

7 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-Hui Lee,

Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp. 159-184,

1996.

[Schwartz et al. 1996] R. Schwartz, L. Nguyen and J. Makhoul, “Multiple-pass search strategies”,

chapter 18 in Automatic Speech and Speaker Recognition, Advanced Topics, edited by Chin-

Hui Lee, Frank K. Soong, and Kuldip K. Paliwal, Kluwer Academic Publishers, USA, pp.

429-456, 1996.

[Siu et al. 1992] M.-H. Siu, G. Yu and H. Gish, “An unsupervised, sequential learning algorithm for

the segmentation of speech waveforms with multiple speakers”, in Proc. IEEE Inter. Conf.

on Acoustics, Speech, and Signal Processing (ICASSP’92), pp. I-189-192, 1992.

[Soong et al. 1987] F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B. H. Juang, “A vector

quantisation approach to speaker recognition”, AT&T Tech. J., vol. 66, pp. 14-26, 1987.

[Syrdal et al. 1995] A. Syrdal, R. Bennett and S. Greenspan, Applied Speech Technology, CRC Press,

Inc, USA, 1995.

[Tanaka and Guo 1999] H. Tanaka and P. Guo, Possibilistic Data Analysis for Operations Research,

Physica-Verlag Heidelberg, Germany, 1999.

[Tran and Wagner 2000a] Dat Tran and Michael Wagner, “Fuzzy Modelling Techniques for Speech

and Speaker Recognition”, the Special Issue on Recognition Technology of the IEEE Trans-

actions on Fuzzy Systems (accepted subject to revision).

[Tran and Wagner 2000b] Dat Tran and Michael Wagner, “Fuzzy Entropy Hidden Markov Models

for Speech Recognition”, submitted to the International Conference on Spoken Language

Processing (ICSLP2000), Beijing, China.

[Tran and Wagner 2000c] Dat Tran and Michael Wagner, “Fuzzy Normalisation Methods for

Speaker Verification”, submitted to the International Conference on Spoken Language Pro-

cessing (ICSLP2000), Beijing, China.

Page 163: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 148

[Tran and Wagner 2000d] Dat Tran and Michael Wagner, “A Proposed Likelihood Transformation

for Speaker Verification”, the International Conference on Acoustics, Speech & Signal Pro-

cessing (ICASSP’2000), Turkey (to appear).

[Tran and Wagner 2000e] Dat Tran and Michael Wagner, “Frame-Level Hidden Markov Models”,

the International Conference on Advances in Intelligent Systems: Theory and Applications

(ISTA’2000), Australia (to appear).

[Tran and Wagner 2000f] Dat Tran and Michael Wagner, “Fuzzy Entropy Clustering”, the FUZZ-

IEEE’2000 Conference, USA (to appear).

[Tran and Wagner 2000g] Dat Tran and Michael Wagner, “An Application of Fuzzy Entropy Clus-

tering In Speaker Identification”, in Proceedings of the Joint Conference on Information Sci-

ences 2000 (Fuzzy Theory and Technology Track), vol. 1, pp. 228-231, 2000, Atlantic City,

NJ, USA.

[Tran and Wagner 2000h] Dat Tran and Michael Wagner, “A General Approach to Hard, Fuzzy, and

Probabilistic Models for Pattern Recognition”, the International Conference on Advances in

Intelligent Systems: Theory and Applications (ISTA’2000), Australia (to appear).

[Tran et al. 2000a] Dat Tran, Michael Wagner and Tuan Pham, “Hard Hidden Markov Models for

Speech Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Infor-

matics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000),

Florida, USA (to appear).

[Tran et al. 2000b] Dat Tran, Michael Wagner and Tuan Pham, “Hard Gaussian Mixture Models

for Speaker Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and In-

formatics/ The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000),

Florida, USA (to appear).

[Tran 1999] Dat Tran, “Fuzzy Entropy Models for Speech Recognition”, the first prize of the 1999

IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia.

[Tran and Wagner 1999a] Dat Tran and Michael Wagner, “Hidden Markov models using fuzzy es-

timation”, in Proceedings of the EUROSPEECH’99 Conference, vol. 6, pp. 2749-2752, 1999,

Hungary.

[Tran and Wagner 1999b] Dat Tran and Michael Wagner, “Fuzzy expectation-maximisation algo-

rithm for speech and speaker recognition”, in Proceedings of the 18th International Conference

of the North American Fuzzy Information Society (NAFIPS’99), pp. 421-425, 1999, USA.

[Tran and Wagner 1999c] Dat Tran and Michael Wagner, “Fuzzy hidden Markov models for speech

and speaker recognition”, in Proceedings of the 18th International Conference of the North

American Fuzzy Information Society (NAFIPS’99), pp. 426-430, 1999, USA.

Page 164: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 149

[Tran and Wagner 1999d] Dat Tran and Michael Wagner, “Fuzzy approach to Gaussian mixture

models and generalised Gaussian mixture models”, in Proceedings of the Computation Intel-

ligence Methods and Applications (CIMA’99) Conference, pp. 154-158, 1999, USA.

[Tran and Wagner 1999e] Dat Tran and Michael Wagner, “A robust clustering approach to fuzzy

Gaussian mixture models for speaker identification”, in Proceedings of the Third International

Conference on Knowledge-Based Intelligent Information Engineering Systems (KES’99), pp.

337-340, Adelaide, Australia.

[Tran et al. 1999a] Dat Tran, Michael Wagner, and Tongtao Zheng, “A Fuzzy Approach to Statis-

tical Models in Speech and Speaker Recognition”, in 1999 IEEE international Fuzzy Systems

Conference Proceedings (FUZZ-IEEE’99), vol. 3, pp. 1275-1280, 1999, Korea.

[Tran et al. 1999b] Dat Tran, Michael Wagner and Tongtao Zheng, “Fuzzy nearest prototype clas-

sifier applied to speaker identification”, in Proceedings of the European Symposium on Intel-

ligent Techniques (ESIT’99) on CD-ROM, abstract on page 34, 1999, Greece.

[Tran et al. 1999c] Dat Tran, Tuan Pham, and Michael Wagner, “Speaker recognition using Gaus-

sian mixture models and relaxation labeling”, in Proceedings of the 3rd World Multiconference

on Systemetics, Cybernetics and Informatics/ The 5th Int. Conf. Information Systems Anal-

ysis and Synthesis (SCI/ISAS99), vol. 6, pp. 383-389, 1999, USA.

[Tran et al. 1999d] Dat Tran, Michael Wagner and Tongtao Zheng, “State mixture modelling ap-

plied to speech and speaker recognition”, in a special issue of the Journal of Pattern Recog-

nition Letters (Pattern Recognition in Practice VI), vol. 20, no. 11-13, pp. 1449-1456, 1999.

[Tran 1998] Dat Tran “Hidden Markov models using state distribution”, the first prize of the 1998

IEEE ACT Section Student Paper Contest, in the Postgraduate Division, Australia.

[Tran and Wagner 1998] Dat Tran and Michael Wagner, “Fuzzy Gaussian Mixture Models for

Speaker Recognition”, in Special issue of the Australian Journal of Intelligent Information

Processing Systems (AJIIPS), vol. 5, no. 4, pp. 293-300, 1998.

[Tran et al. 1998a] Dat Tran, T. VanLe and Michael Wagner, “Fuzzy Gaussian Mixture Models for

Speaker Recognition”, in Proceedings of the International Conference on Spoken Language

Processing (ICSLP98), vol. 2, pp. 759-762, 1998, Australia.

[Tran et al. 1998b] Dat Tran, Michael Wagner and T. VanLe, “A proposed decision rules based

on fuzzy c-means clustering for speaker recognition”, in Proceedings of the International

Conference on Spoken Language Processing (ICSLP98), vol. 2, pp. 755-758, 1998, Australia.

[Tran et al. 1998c] Dat Tran, Michael Wagner and Tuan Pham, “Minimum Classifier Error and Re-

laxation Labelling for Speaker Recognition”, in Proceedings of the Speech computer Workshop,

St Petersburg, (Specom 98), pp. 229-232, 1998, Russia.

Page 165: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 150

[Tran et al. 1998d] Dat Tran, Minh Do, Michael Wagner and T. VanLe, “A proposed decision rule

for speaker identification based on a posteriori probability”, in Proceedings of the ESCA

Workshop (RLA2C98), pp. 85-88, 1998, France.

[Tseng et al. 1987] H.-P. Tseng, M. J. Sabin and E. A. Lee, “Fuzzy vector quantisation applied

to hidden Markov modelling”, in Proc. of the Inter. Conf. on Acoustics, Speech & Signal

Processing (ICASSP’87), pp. 641-644, 1987.

[Tsuboka and Nakahashi 1994] E. Tsuboka and J. Nakahashi, “On the fuzzy vector quantisation

based hidden Markov model”, in Proc. Inter. Conf. on Acoustics, Speech & Signal Processing

(ICASSP’94), vol. 1, pp. 537-640, 1994.

[Upper 1997] D. R. Upper, Theory and algorithms for hidden Markov models and generalised hidden

Markov models, PhD thesis in Mathematics, University of California at Berkeley, 1997.

[Varga and Moore 1990] A. P. Varga and R. K. Moore, “Hidden Markov model decomposition

of speech and noise”, in Proc. Inter. Conf. on Acoustics, Speech & Signal Processing

(ICASSP’90), pp. 845-848, 1990.

[Wagner 1996] Michael Wagner, “Combined speech-recognition speaker-verification system with

modest training requirements”, in Proc. Sixth Australian International Conf. on Speech Sci-

ence and Technology, Adelaide, Australia, pp. 139-143, 1996.

[Wang 1992] Z. Wang and G. J. Klir, Fuzzy Measure Theory, Plenum Press, 1992.

[Wilcox et al. 1994] L. Wilcox, F. Chen, D. Kimber, and V. Balasubramanian, “Segmentation of

speech using speaker identification”, in Proc. IEEE Inter. Conf. on Acoustics, Speech, and

Signal Processing (ICASSP’94), pp. I-161-164, 1994.

[Windham 1983] M. P. Windham, “Geometrical fuzzy clustering algorithms”, Fuzzy sets Syst., vol.

10, pp. 271-279, 1983.

[Woodland 1997] P. C. Woodland, “Broadcast news transcription using HTK”, in Proc. Inter. Conf.

on Acoustics, Speech & Signal Processing (ICASSP’97), pp. , USA, 1997.

[Wu 1983] C. F. J. Wu, “On the convergence properties of the EM algorithm”, Ann. Stat., vol. 11,

pp. 95-103, 1983.

[Yang and Cheng 1993] M. S. Yang and C. T. Chen, “On strong consistency of the fuzzy generalized

nearest neighbour rule”, Fuzzy Sets Syst., vol. 3, no. 60, pp. 273-281, 1993.

[Zadeh 1995] L. A. Zadeh, “Discussion: probability theory and fuzzy logic are complementary rather

than competitive”, Technometrics, vol. 37, no. 3, pp. 271-276, 1995.

Page 166: Fuzzy Approaches to Speech and Speaker Recognition

BIBLIOGRAPHY 151

[Zadeh 1994] L. A. Zadeh, “Fuzzy logic, neural networks, and soft computing”, Communications of

the ACM, vol. 37, no. 3, pp. 77-84, 1994.

[Zadeh 1978] L. A. Zadeh, “Fuzzy sets as a basis for a theory of possibility”, Fuzzy Sets and Systems,

vol. 1, no. 1, pp. 3-28, 1978.

[Zadeh 1977] L. A. Zadeh, “Fuzzy sets and their application to pattern classification and clustering

analysis”, Classification and Clustering, edited by J. Van Ryzin, Academic Press Inc, pp.

251-282 & 292-299, 1977.

[Zadeh 1976] L. A. Zadeh, “The linguistic approach and its application to decision analysis”, Di-

rections in large scale systems, edited by Y. C. Ho and S. K. Mitter, Plenum Publishing

Corporation, pp. 339-370, 1976.

[Zadeh 1968] L. A. Zadeh, “Probability measures of fuzzy events”, J. Math. Anal. Appl., vol. 23,

no. 2, pp. 421-427, 1968.

[Zadeh 1965] L. A. Zadeh, “Fuzzy Sets”, Inf. Control., vol. 8, no. 1, pp. 338-353, 1965.

[Zhuang et al. 1989] X. Zhuang, R. M. Haralick and H. Joo, “A simplex-like algorithm for the

relaxation labeling process”, IEEE Trans. Patt. Anal. Mach. Intell., vol. 11, pp. 1316-1321,

1989.

Page 167: Fuzzy Approaches to Speech and Speaker Recognition

Appendix A

List of Publications

1. Dat Tran, “Fuzzy Entropy Models for Speech Recognition”, the first prize of the 1999 IEEE

ACT Section Student Paper Contest, in the Postgraduate Division, Australia.

2. Dat Tran “Hidden Markov models using state distribution”, the first prize of the 1998 IEEE

ACT Section Student Paper Contest, in the Postgraduate Division, Australia.

3. Dat Tran and Michael Wagner, “Fuzzy Modelling Techniques for Speech and Speaker Recogni-

tion”, the Special Issue on Recognition Technology of the IEEE Transactions on Fuzzy Systems

(accepted subject to revision).

4. Dat Tran and Michael Wagner, “Fuzzy Entropy Hidden Markov Models for Speech Recog-

nition”, submitted to the International Conference on Spoken Language Processing (IC-

SLP’2000), Beijing, China.

5. Dat Tran and Michael Wagner, “Fuzzy Normalisation Methods for Speaker Verification”,

submitted to the International Conference on Spoken Language Processing (ICSLP’2000),

Beijing, China.

6. Dat Tran and Michael Wagner, “A Proposed Likelihood Transformation for Speaker Verifica-

tion”, the International Conference on Acoustics, Speech & Signal Processing (ICASSP’2000),

Turkey (to appear).

7. Dat Tran and Michael Wagner, “Frame-Level Hidden Markov Models”, the International

Conference on Advances in Intelligent Systems: Theory and Applications (ISTA’2000), Aus-

tralia (to appear).

8. Dat Tran and Michael Wagner, “A General Approach to Hard, Fuzzy, and Probabilistic

Models for Pattern Recognition”, the International Conference on Advances in Intelligent

Systems: Theory and Applications (ISTA’2000), Australia (to appear).

9. Dat Tran and Michael Wagner, “Fuzzy Entropy Clustering”, the FUZZ-IEEE’2000 Confer-

ence, USA (to appear).

152

Page 168: Fuzzy Approaches to Speech and Speaker Recognition

153

10. Dat Tran and Michael Wagner, “An Application of Fuzzy Entropy Clustering In Speaker

Identification”, in Proceedings of the Joint Conference on Information Sciences 2000 (Fuzzy

Theory and Technology Track), vol. 1, pp. 228-231, 2000, Atlantic City, NJ, USA.

11. Dat Tran and Michael Wagner, “Hidden Markov models using fuzzy estimation”, in Proceed-

ings of the EUROSPEECH’99 Conference, vol. 6, pp. 2749-2752, 1999, Hungary.

12. Dat Tran and Michael Wagner, “Fuzzy expectation-maximisation algorithm for speech and

speaker recognition”, in Proceedings of the 18th International Conference of the North Amer-

ican Fuzzy Information Society (NAFIPS’99), pp. 421-425, 1999, USA (Outstanding Student

Paper Award, Top Honor).

13. Dat Tran and Michael Wagner, “Fuzzy hidden Markov models for speech and speaker recog-

nition”, in Proceedings of the 18th International Conference of the North American Fuzzy

Information Society (NAFIPS’99), pp. 426-430, 1999, USA (Outstanding Student Paper

Award, Top Honor).

14. Dat Tran and Michael Wagner, “Fuzzy approach to Gaussian mixture models and generalised

Gaussian mixture models”, in Proceedings of the Computation Intelligence Methods and

Applications (CIMA’99) Conference, pp. 154-158, 1999, USA.

15. Dat Tran and Michael Wagner, “A robust clustering approach to fuzzy Gaussian mixture

models for speaker identification”, in Proceedings of the Third International Conference on

Knowledge-Based Intelligent Information Engineering Systems (KES’99), pp. 337-340, Ade-

laide, Australia.

16. Dat Tran and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker Recognition”,

in Special issue of the Australian Journal of Intelligent Information Processing Systems (AJI-

IPS), vol. 5, no. 4, pp. 293-300, 1998.

17. Dat Tran, Michael Wagner and T. VanLe, “A proposed decision rules based on fuzzy C-means

clustering for speaker recognition”, in Proceedings of the International Conference on Spoken

Language Processing (ICSLP’98), vol. 2, pp. 755-758, 1998, Australia.

18. Dat Tran, Michael Wagner and Tuan Pham, “Hard Hidden Markov Models for Speech Recog-

nition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/ The 6th

Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida, USA (to

appear).

19. Dat Tran, Michael Wagner and Tuan Pham, “Hard Gaussian Mixture Models for Speaker

Recognition”, the 4rd World Multiconference on Systemetics, Cybernetics and Informatics/

The 6th Int. Conf. Information Systems Analysis and Synthesis (SCI/ISAS’2000), Florida,

USA (to appear).

Page 169: Fuzzy Approaches to Speech and Speaker Recognition

154

20. Dat Tran, Michael Wagner and Tuan Pham, “Minimum classifier Error and Relaxation la-

belling for speaker recognition”, in Proceedings of the Speech computer Workshop, St Peters-

burg, (Specom’98), pp. 229-232, 1998, Russia.

21. Dat Tran, Michael Wagner and Tongtao Zheng, “State mixture modelling applied to speech

and speaker recognition”, in a special issue of the Journal of Pattern Recognition Letters

(Pattern Recognition in Practice VI), vol. 20, no. 11-13, pp. 1449-1456, 1999.

22. Dat Tran, Michael Wagner, and Tongtao Zheng, “A Fuzzy Approach to Statistical Models

in Speech and Speaker Recognition”, in 1999 IEEE international Fuzzy Systems Conference

Proceedings (FUZZ-IEEE’99), vol. 3, pp. 1275-1280, 1999, Korea.

23. Dat Tran, Michael Wagner and Tongtao Zheng, “Fuzzy nearest prototype classifier applied to

speaker identification”, in Proceedings of the European Symposium on Intelligent Techniques

(ESIT’99) on CD-ROM, abstract on page 34, 1999, Greece.

24. Dat Tran, Tuan Pham, and Michael Wagner, “Speaker recognition using Gaussian mixture

models and relaxation labeling”, in Proceedings of the 3rd World Multiconference on Sys-

temetics, Cybernetics and Informatics/ The 5th Int. Conf. Information Systems Analysis

and Synthesis (SCI/ISAS’99), vol. 6, pp. 383-389, 1999, USA.

25. Dat Tran, T. VanLe and Michael Wagner, “Fuzzy Gaussian Mixture Models for Speaker

Recognition”, in Proceedings of the International Conference on Spoken Language Processing

(ICSLP’98), vol. 2, pp. 759-762, 1998, Australia (paper selected to publish in a special issue

of the Australian Journal of Intelligent Information Processing Systems).

26. Dat Tran, Minh Do, Michael Wagner and T. VanLe, “A proposed decision rule for speaker

identification based on a posteriori probability”, in Proceedings of the ESCA Workshop

(RLA2C’98), pp. 85-88, 1998, France.

27. Tuan Pham, Dat Tran and Michael Wagner, “Optimal fuzzy information fusion for speaker

verification”, in Proceedings of the Computation Intelligence Methods and Applications Con-

ference (CIMA’99), pp. 141-146, 1999, USA.

28. Tuan Pham, Dat Tran and Michael Wagner, “Speaker verification using relaxation labeling”,

in Proceedings of the ESCA Workshop (RLA2C’98), pp. 29-32, 1998, France.

29. Le, T. V., Tran D., and Wagner, M., “Fuzzy evolutionary programming for hidden Markov

modelling in speaker identification”, in Proceedings of the Congress on Evolutionary Compu-

tation (CEC’99), Washington DC, pp. 812-815, July 1999.