Speaker Profiling for Forensic Applications

Speaker Profiling for Forensic Applications

Amir Hossein Poorjam

Thesis submitted for the degree ofMaster of Science in

Electrical Engineering, optionEmbedded Systems and Multimedia

Thesis supervisor:Prof. dr. ir. Hugo Van hamme

Assessors:Prof. dr. ir. Dirk Van Compernolle

Prof. dr. ir. Marc Moonen

Mentor:Dr. ir. Mohamad Hasan Bahari

Academic year 2013 – 2014

c© Copyright KU Leuven

Without written permission of the thesis supervisor and the author it is forbiddento reproduce or adapt in any form or by any means any part of this publication.Requests for obtaining the right to reproduce or utilize parts of this publication shouldbe addressed to Departement Elektrotechniek, Kasteelpark Arenberg 10 postbus2440, B-3001 Heverlee, +32-16-321130 or by email [email protected].

A written permission of the thesis supervisor is also required to use the methods,products, schematics and programs described in this work for industrial or commercialuse, and for submitting this publication in scientific contests.

Preface

I would like to express my special appreciation and thanks to my promotor, professorDr. Hugo Van hamme for offering me the opportunity to do research in his group.I am grateful to him for his consideration all the time, and for always guiding medown the right path.

I would also like to express my sincere gratitude to my daily supervisor, Dr.Mohamad Hasan Bahari, for continuous support of my research. His thoughtfulinsights in this field have inspired me a lot through this thesis work. The good advice,support and friendship of Hasan, has been invaluable on both an academic and apersonal level, for which I am extremely grateful. He has also provided me necessarymaterials to perform research on speaker profiling.

My thanks goes to the members of PSI Speech group at ESAT who provided mefacilities in the speech lab to fulfill my thesis.

I would also like to thank professor Dr. Dirk Van Compernolle for his advices inmy first thesis presentation and a speech group presentation sessions. My thanksgoes to him and other member of jury for reading this thesis.

Words cannot express how grateful I am to my parents for supporting mespiritually throughout my life, and providing me the opportunity to study abroad,and also my mother-in-law and father-in-law for all their kindness and supports. Iwould like to express my gratefulness to them.

My special and sincere appreciation goes to my beloved wife who was always mysupport in all moments. I lovingly dedicate this thesis to her, who supported meeach step of the way.

I would like to thank my previous supervisor in my Bachelor study, professor Dr.Jahangir Bagheri, who has always encouraged me during my study, and motivatedme to continue my higher education in this field. I learned a lot from his knowledgeand personality.

Finally, my thanks go to all my lovely friends: Ali Charkhi, Mostafa Yaghobi, Dr.Hadi Aliakbarian, Taha Mirhoseini, Dr. Majid Hosseinzadeh, Reza Sahraeian, MiladYavari, Saeed Reza Toghi, Hasan Farrokhzad, Rahim Khanizad and all other newand old friends for accompanying, encouraging and helping me during my study inLeuven.

Amir Hossein Poorjam

i

Contents

Preface iAbstract ivList of Abbreviations and Symbols v1 Introduction 12 Automatic Speaker’s Age Estimation from Spontaneous

Telephone Speech 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Automatic Speaker’s Height Estimation from SpontaneousTelephone Speech 213.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Automatic Speaker’s Weight Estimation from SpontaneousTelephone Speech 294.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Automatic Smoker Detection from Spontaneous Telephone Speech 375.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Multitask Speaker Profiling 49

ii

Contents

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 556.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7 Conclusion 617.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . 617.2 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A Artificial Neural Networks (ANNs) 67A.1 Multilayer Perceptron Neural Networks . . . . . . . . . . . . . . . . 67A.2 Regression and Classification using MLPs . . . . . . . . . . . . . . . 69A.3 Limitations of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

B Least Squares Support Vector Machines (LSSVM) 73B.1 Least Squares Support Vector Machines for Classification . . . . . . 73B.2 Least Squares Support Vector Machines for Regression . . . . . . . . 75

C Logistic Regression 77C.1 Logistic Regression for Binary Classification . . . . . . . . . . . . . . 77C.2 MLE in the Logistic Regression model . . . . . . . . . . . . . . . . . 78C.3 Advantages and Limitations of the Logistic Regression . . . . . . . . 80

Bibliography 81

iii

Abstract

Speech signals convey important paralinguistic information such as age, gender, bodysize, language, accent and emotional state of speakers. Automatic identificationof speaker traits and states has a wide range of forensic, commercial and medicalapplications in real-world scenarios. This thesis proposes a novel approach forautomatic estimation of four forensically important components of speaker profilingsystems, namely speaker age, height, weight and smoking habit estimations, fromspontaneous telephone speech signals. In this method, each utterance is modeledusing the i-vector framework which is based on the factor analysis on GaussianMixture Model (GMM) mean supervectors, and the Non-negative Factor Analysis(NFA) framework which is based on a constrained factor analysis on GMM weightsupervectors. Then, Artificial Neural Networks (ANNs) and Least Squares SupportVector Regression (LSSVR) are employed to estimate age, height and weight ofspeakers from given utterances. Various classification techniques such as ANNs,Logistic Regression (LR), Naive Bayesian Classifier (NBC), Gaussian Scoring (GS)and Von-Mises-Fisher Scoring (VMF) are also utilized to perform smoking habitdetection. Since GMM weights provide complementary information to GMM means, ascore-level fusion of the i-vector-based and the NFA-based recognizers is considered forspeaker age estimation and smoking habit detection tasks to improve the performance.

In addition, inspired from the human learning system in which the related tasksare learned in interaction with each other, a multitask speaker profiling approachis proposed to evaluate the correlated tasks simultaneously, and consequently, toboost the accuracy in speaker age, height, weight and smoking habit estimations. Tothis end, a hybrid architecture involving the score-level fusion of the i-vector-basedand the NFA-based recognizers is proposed. ANNs are then employed to provide anappropriate architecture to share the learned information with all tasks while theyare learned in parallel.

The proposed method has two major distinctions with the previous speaker pro-filing approaches. First, information in both GMM means and weights is employedthrough a score-level fusion of the i-vector-based and the NFA-based recognizers. Sec-ond, by applying multitask learning, correlated tasks, which are usually investigatedin isolation, are evaluated simultaneously and in interaction with each other.

The suggested approach is evaluated on telephone speech signals of NationalInstitute for Standards and Technology (NIST) 2008 and 2010 Speaker RecognitionEvaluation (SRE) corpora. Experimental results over 1194 utterances show theeffectiveness of the proposed method in automatic speaker profiling.

iv

List of Abbreviations andSymbols

General Abbreviations

ANNs : Artificial Neural NetworksAUC : Area Under the ROC CurveBFG : Broyden Fletcher GoldfarbCC : (Pearson) Correlation CoefficientCGF : Fletcher-Reeves Conjugate GradientGMM : Gaussian Mixture ModelGS : Gaussian ScoringLM : Levenberg-MarquardtLR : Logistic RegressionLSSVM : Least squares Support Vector MachinesLSSVR : Least Squares Support Vector RegressionMAE : Mean Absolute ErrorMAP : Maximum-A-PosterioriMFCC : Mel-Frequency Cepstrum CoefficientMLP : Multilayer PerceptronMTL : Multitask LearningNBC : Naive Bayesian ClassifierNFA : Non-negative Factor AnalysisNIST : National Institute for Standard and TechnologyRBF : Radial Basis FunctionROC : Receiver Operating CharacteristicSGR : Subglottal ResonanceSRE : Speaker Recognition EvaluationsSTL : Single-Task LearningUBM : Universal Background ModelVMF : Von-Mises-Fisher ScoringVTL : Vocal Tract Length

v

List of Abbreviations and Symbols

General Symbols and Definitions

xi : ith utteranceo : Acoustic vectoryi : Label of the ith utteranceyAi : Age label of the ith utteranceyHi : Height label of the ith utteranceyWi : Weight label of the ith utteranceySi : Smoking habit label of the ith utterancey : Estimated labelλ : Parameters of the UBMM : GMM mean supervectorT : Subspace matrixu : The UBM mean supervectorv : i-vectorw : GMM weight supervectorL : Subspace matrixb : The UBM weight supervectorr : NFA-vectorN : Total number of test samplesFmin0 : Lowest fundamental frequency of voiceθ : Parameters of the logistic regression modelΨ : Covariance matrixCllr,min : Minimum Log-Likelihood Ratio Cost

vi

Chapter 1

Introduction

Speech signals convey important information about speakers such as age, gender,body size, language, accent and emotional state. Speaker profiling refers to extractinginformation about a speaker from his/her speech pattern. Automatic identificationof speaker characteristics has a wide range of forensic, commercial and medicalapplications in real-world scenarios.

Forensic is one of the most important area of speaker profiling applicationswhere it can give cues to the identities of unknown speakers. Police investigatorscontinuously look for technologies to enhance investigative techniques. Hence, speakerprofiling is considered as an important investigation tool for police application. Insome forensic scenarios, a voice recording of the criminal act can be made, e.g. athreat call and a blackmail call. Police inspectors may have a list of suspects, but norecording of a suspect is available to be compared with the voice of the unknownspeaker. They might loose time by checking all suspects. Since age and physicalcharacteristics are important factors when forming a picture of an unknown speaker,it can be beneficial in the early stages of a police investigation to rank suspectsaccording to objective criteria such as gender, age and body size. This action isclassified in the automatic speaker profiling category. However, identification of thespeakers’ characteristics can be performed by listeners, which is out of the scope ofthis study. In this case, the recorded voice sample is presented to a wide public tofind suspects [74].

Speaker profiling is also used in other applications such as improving servicequality in dialog systems, categorizing large music databases with potentially unknownartists, protection of children in web environment, interactive voice response systems,service customization, natural human-machine interaction, recognizing the type ofpathology of the speakers and adaptation of waiting queue music for offering the mostsuitable advertisements to callers in the waiting queue [73, 77, 116, 101, 38, 91, 90, 62].

Variety of traits of a speaker can be inferred form his/her speech. In this study,however, only four characteristics, namely speaker age, height, weight and smokinghabit, which can be considered as the most important traits from the forensics pointof view are investigated.

In automatic speaker profiling we need features in voice pattern to provide us cues

1

1. Introduction

to speakers’ characteristics. Most of the voices used in speaker profiling are speechsamples. However, other voices such as laughs or screams may also be forensicallyimportant. Various acoustic features such as fundamental frequency, sound pressurelevel, voice quality, distribution of spectral energy, amplitude, pitch, formants, vocaltract length warping factor, jitter, shimmer and speech rate have been demonstratedto have a (weak or strong) correlation with each aspect of the speakers’ characteristics.For instance, segment duration, sound pressure level range, and cepstral featureswere reported as the most important acoustic features which correlate with the ageof speakers [89, 1, 99, 69, 112], or speech rate was reported as an acoustic featurewhich has a significant correlation with the weight of speakers [104]. However, therelation of these acoustic features are usually influenced by other factors such aslanguage, gender, speech context, smoking habits, level of intoxication, body size,channel conditions and emotional state which are not typically tractable in real-worldsituations [21, 87, 12, 13].

Modeling speech utterances with Gaussian Mixture Model (GMM) mean supervec-tors has been considered as an effective approach to speaker recognition systems [25].However, due to the high dimensional nature of these vectors, a robust model withGMM mean supervectors is not easily obtained when limited data are available.Efforts towards effective dimensionality reduction of GMM mean supervectors, suchas the weighted-pairwise principal component analysis (WPPCA) based on the nui-sance attribute projection technique, have improved the performance of the speakerrecognition systems [38].

Recent advances using the i-vector framework based on factor analysis for GMMmean adaptation and decomposition have effectively increased the accuracy ofspeaker profiling systems [33]. The i-vector framework, which provides a compactrepresentation of an utterance in the form of a low-dimensional feature vector can beeffectively substituted for the GMM mean supervectors in speaker profiling tasks [9,83]. In addition, various studies show that GMM weights carry complementaryinformation to GMM means [6, 62, 115, 11, 7, 8]. This framework, named theNon-negative Factor analysis (NFA), is based on adaptation and decomposition ofGMM weights, and yields a new low-dimensional utterance modeling approach.

In this study, in two steps, novel approaches for four forensically importantspeaker profiling tasks are proposed which are based on the i-vector and the NFAframeworks. In the first step, new techniques for speaker age, height and weightestimations as well as smoking habit detection are proposed, independently. Thisstep, provides baselines for the next phase of the experiments in which a new methodis proposed to investigate the correlated tasks simultaneously, in the form of amultitask learning approach. The goal of this study is to improve the performanceof the above mentioned speaker profiling tasks. To demonstrate the effectiveness ofthe proposed methods, a large corpus of speech samples consisting of the NationalInstitute for Standard and Technology (NIST) 2008 and 2010 speaker recognitionevaluations (SRE) databases are utilized.

In Chapter 2, a new approach in automatic speaker age estimation is proposed.This approach is based on a hybrid architecture of the i-vector and the NFA frame-works. The distinction of the proposed method with the previous speech-based

2

methods is employing the information in the GMM weights in conjunction with theinformation in the GMM means through a score-level fusion of the i-vector-basedand the NFA-based estimators, in order to enhance the accuracy of age estimation.Two different function approximation methods, namely least squares support vectorregression (LSSVR) and artificial neural networks (ANNs) are utilized and comparedin this chapter.

Innovative proposed methods for speaker height and weight estimations basedon the i-vector framework are described in detail in Chapter 3 and Chapter 4,respectively. The goal of this study is investigating the effectiveness of the i-vectorframework in the estimation of speakers’ body size. In these chapters, the heightand weight estimation is fulfilled by training LSSVR and ANNs with the i-vectors.

Different speech analysis systems such as speaker gender detection, age estimation,intoxication-level recognition and emotional state identification are influenced bysmoking. Due to the importance of an automatic smoking habit detection system andthe analysis of the smoking habit effects on speech signals in the forensic applications,an automatic smoker detection from spontaneous telephone speech signals is proposedfor the first time (to my knowledge) in Chapter 5. In this method, each utterance ismodeled using the i-vector and the NFA frameworks. Then, various classificationalgorithms are employed to detect the smoker speakers. Finally, score-level fusionof the i-vector-based and the NFA-based recognizers is considered to boost theclassification accuracy.

Inspired from the human learning system in which the related tasks are learnedin interaction with each other, in Chapter 6, a multitask learning (MTL) approachis proposed to improve the performance of the speaker profiling system. Using aMTL approach, this study aims at evaluating the correlated tasks simultaneouslyand in interaction with each other. In addition, this chapter explores the MTL as anapproach to improve the accuracy of recognizers by means of sharing the learnedinformation between the related tasks.

Finally, Chapter 7 concludes this thesis and suggests possible directions for futureworks.

In this study, various techniques are employed for classification and regressionmethods. Among them, the concepts and relations of the ANNs, LSSVR and logisticregression (LR) techniques are selected as the most important and commonly usedmethods for classification and regression to be described in more details. In orderto maintain the integrity of the content in the chapters, they are elaborated inAppendix A, Appendix B and Appendix C, respectively.

3

Chapter 2

Automatic Speaker’s AgeEstimation from SpontaneousTelephone Speech

2.1 Introduction

Speaker age estimation as an important component of speaker profiling, can be utilizedin forensic applications to direct the police investigations. However, the range of itsapplications is not limited to the forensic cases since it can be effectively employedin the commercial, medical and educational applications. Service customization,protection of children in web environment, adaptation of waiting queue music foroffering the most suitable advertisements to callers in the waiting queue, and human-computer interaction systems are examples of applications of the automatic speakerage estimation in other fields. This wide range of applications has attracted a lotof researchers’ attention and encouraged them to investigate the automatic ageestimation issues precisely.

Like other speaker profiling tasks, automatic age estimation from speech signalsinvolves two problems. First, finding an appropriate utterance modeling procedureby means of extracting the most relevant features of the acoustic signals whichprovide cues to the age of speakers, and second, providing an appropriate functionapproximation or classifier to estimate the age of speakers as accurate as possible.Thus, a reliable automatic speaker age estimation requires a large corpus of utteranceswith corresponding age labels. This corpus should contain voices uttered by a widerange of ages. In addition, there is a difference between perceived age and calenderage of a speaker. The perceived age of a speaker is modified by factors such asdrinking and smoking habits and physiological condition [21, 84]. Authors in [93]showed that the correlation between calender age and perceived age is 0.88. Thereported correlation in [79], however, was 0.77. This issue makes the problem ofautomatic age estimation more challenging.

As the speech production system is modified with aging, speech is affected innumerous ways. Many acoustic features such as fundamental frequency, sound

5

2. Automatic Speaker’s Age Estimation from Spontaneous TelephoneSpeech

pressure level, voice quality, distribution of spectral energy, amplitude and speechrate are modified with aging [85, 2, 113, 87]. These age dependent features can beused in automatic age estimation. Schotz and Muller investigated the correlation ofnumerous acoustic features with the age of speakers. They reported two features,namely segment duration and sound pressure level range as the most importantacoustic features which correlate with the age of speakers [89].

Ajmera et. al. applied discrete cosine transform to the cepstral coefficientsand showed that the coefficients corresponding to the lower modulation frequenciesprovide the best discrimination of age [1]. In further studies, other features suchas pitch, energy, formants, vocal tract length warping factor and speaking ratewere added to the cepstral features at the frame or utterance level to improve theperformance [99, 69, 112]. However, the relation of these acoustic features are usuallyinfluenced by other factors such as language, gender, speech context, smoking habits,level of intoxication, body size, channel conditions and emotional state which arenot typically tractable in real-world situations [21, 87, 12, 13].

Furthermore, many studies focused on the classification of speakers based ontheir age groups by utilizing techniques such as Gaussian Mixture Model (GMM)mean supervector and Support Vector Machine systems [19, 66, 28, 105], nuisanceattribute projection [37], parallel phoneme recognizers [70], maximum mutual infor-mation training [58] and anchor models[37, 58]. These techniques were mostly takenfrom speaker verification and language identification applications. By combiningvarious classification methods, significant improvements in accuracy of speaker ageclassification have been reported in [75, 105, 69, 20, 58].

In study by Bocklet et al., the ages of children in pre-school and primary schoolwere effectively estimated based on the modeling speech signals with GMM meansupervector and support vector regression (SVR) approach [19]. Although GMMmean super vectors are effective in speaker age estimation, since they are highdimensional vectors, obtaining a robust model is not a straightforward approach,specially when limited data are available. Dobry reduced the dimension of GMMmean supervectors by means of the weighted-pairwise principal component analysis(WPPCA) based on the nuisance attribute projection technique, and enhanced theperformance of the age estimation [38].

In the field of speaker recognition, recent advances using the i-vector frame-work [33] have considerably increased the classification accuracy [9]. The i-vectorframework provides a compact representation of utterance in the form of a low-dimensional feature vector. The GMM mean supervectors can be effectively substi-tuted by the i-vectors in speaker age estimation [9, 10].

Various studies demonstrate that although GMM weights, which entail a lowerdimension compared with Gaussian mean supervectors, convey less information thanGMMmeans, they contain complementary information to GMMmeans [6, 62, 115, 11].Bahari et al. have recently introduced a new framework based on factor analysisfor GMM weight adaptation and decomposition [7, 34]. In this method, namednon-negative factor analysis (NFA), the applied factor analysis is constrained suchthat the adapted GMM weights are non-negative and sum to unity. This method,which yields new low-dimensional utterance representation approach, was successfully

6

2.2. System Description

applied to speaker and language/dialect recognition [7, 8].In this chapter, a novel approach for speaker age estimation based on a compound

architecture of the i-vector and the NFA frameworks is proposed. This architectureconsists of two subsystems based on the i-vectors and the NFA vectors. To improvethe performance of proposed speaker age estimation, score-level fusion of the i-vector-based and the NFA-based function approximations is also considered. Thesuperiority of this method over previous age estimation methods is that in thismethod, information in GMM means and GMM weights are employed to enhancethe accuracy of age estimation.

To select an accurate regression approach for this problem, two different functionapproximation approaches, namely least-squares support vector regression (LSSVR)and artificial neural networks (ANNs) are compared.

In this research, the effectiveness of the proposed method is investigated onspontaneous telephone speech signals of the NIST 2008 and 2010 SRE corpora.Experimental results confirm the effectiveness of the proposed approach comparedwith the results of a baseline provided from the same database [9].

The rest of the chapter is organized as follows. In section 2.2, the problem ofautomatic age estimation is formulated and the proposed approach is described.Section 2.3 explains the experimental setup. The evaluation results are presentedand discussed in Section 2.4. The chapter concludes with conclusions in Section 2.5.

2.2 System Description

In this section, after the problem formulation, the main constituents of the proposedmethod are described.

2.2.1 Problem Formulation

In the speaker age estimation problem, we are given a set of training data D =xi, yiNi=1, where xi ∈ Rp denotes the ith utterance and yi ∈ R denotes the corre-sponding chronological age. The goal is to design an estimation function g, such thatfor an utterance of an unseen speaker xtst, the estimated age y = g(xtst) approximatesthe actual age as good as possible in some predefined sense.

2.2.2 Utterance Modeling

The first step for speaker age estimation is converting variable-duration speechsignals into fixed-dimensional vectors, suitable for regression algorithms which isperformed by fitting a GMM to acoustic features extracted from each speech signal.The utterance is then characterized by the parameters of the corresponding obtainedGMMs.

Since the available data is limited, we are not able to accurately fit a separateGMM for a short utterance, specially in the case of GMMs with a high number ofGaussian components. Thus, parametric utterance adaptation techniques shouldbe applied to adapt a universal background model (UBM) to characteristics of

7


utterances in training and testing databases. In this chapter, the i-vector frameworkfor adapting UBM means and the NFA framework for adapting UBM weights areapplied.

Universal Background Model and Adaptation

Consider a UBMwith the following likelihood function of dataO = o1, . . . ,ot, . . . ,oT .

p(ot|λ) =C∑c=1

πcp(ot|µc,Σc)

λ = πc, µc,Σc, c = 1, . . . C, (2.1)

where ot is the acoustic vector at time t, πc is the mixture weight for the cth mixturecomponent, p(ot|µc,Σc) is a Gaussian probability density function with mean µcand covariance matrix Σc, and C is the total number of Gaussian components inthe mixture. The parameters of the UBM –λ– are estimated on a large amount oftraining data from speakers of different ages.

The i-vector Framework

One effective method for speaker age estimation involves adapting UBM means to thespeech characteristics of the utterance. Then, the adapted GMM means are extractedand concatenated to form Gaussian mean supervectors. This method have beenshown to provide a good level of performance [38, 19]. Recent progress in this field,however, has found an alternate method of modeling GMM mean supervectors thatprovides superior recognition performance [13]. However, since Gaussian componentsof the UBM model are adapted independent of each other, some components were notupdated in the case of limited training samples [56]. This problem can be alleviatedby linking the Gaussian components together using the Joint Factor Analysis (JFA)framework [94].

In the JFA framework, each utterance is represented by a supervector M whichis a speaker- and channel-dependent vector of dimension (C.F ), where C is the totalnumber of the mixture components in a feature space of dimension F . In JFA, it isassumed that M can be decomposed into two supervectors

M = s + c (2.2)

where s = u + Vy+ Dz is a speaker-dependent supervector and c = Ux is a channel-dependent supervector. s and c are independent and possess normal distributions. uis the speaker- and channel-independent supervector, V defines a lower dimensionalspeaker subspace, U is a lower dimensional channel subspace, D defines a speakersubspace. y and z are factors in speaker subspace, and x is a channel-dependent factorin channel subspace. The vectors x, y and z are random variables with standardnormal distributions N(0, I) which are jointly estimated.

In the JFA framework, some information about speakers can be found in thechannel factor. This information can be utilized in speaker identification [32]. This

8


fact resulted in proposing a new utterance modeling approach, referred to as thei-vector framework or the total variability modeling [33, 32]. This method comprisesboth speaker variability and channel variability. Channel compensation proceduressuch as within-class covariance normalization (WCCN) can be further applied tocompensate the residual channel effects in the speaker factor space [51].

The i-vector framework assumes that each utterance possesses a speaker- andchannel-dependent GMM supervector which its mean, M, can be decomposed as

M = u + Tv, (2.3)

where u is the speaker- and channel-independent mean supervector of the UBM, Tspans a low-dimensional subspace (400 dimensions in this work) and v are the factorsthat best describe the utterance-dependent mean offset Tv. The vector v is treatedas a hidden variable with the standard normal prior and the i-vector is its maximum-a-posteriori (MAP) point estimate. For a sequence of L frames O = o1, o2, ..., oL,a UBM model (θUBM ) of C mixture components, and the centralized first-orderBaum-Welch statistics given in equation 2.4, the i-vector for a given utterance iscalculated using the equation 2.5.

Fc =L∑t=1

P (c|ot, θUBM )(ot −mc) (2.4)

v = (I + T′Σ−1N(O)T)−1T′Σ−1F(O) (2.5)

where mc is the mean of the cth UBM mixture component, P (c|ot, θUBM ) is theposterior probability of the cth mixture component, F(O) is a concatenation of allFcs, N(O) is the diagonal matrix of dimension (CF × CF ), and Σ is a covariancematrix. The subspace matrix T is estimated via maximum likelihood in a largetraining dataset. An efficient procedure for training T and for MAP adaptation ofthe i-vectors can be found in [57].

The i-vector is taken its name from intermediate vector, for the intermediaterepresentation between an acoustic feature vector and a supervector [32]. In the totalvariability modeling approach, the i-vectors are the low-dimensional representationof an audio recording that can be used for classification and estimation purposes.

The NFA Framework

The NFA is a new framework for adaptation and decomposition of GMM weightsbased on a constrained factor analysis [7].

The basic assumption of this method is that for a given utterance, the adaptedGMM weight supervector can be decomposed as follows

w = b + Lr, (2.6)

where b is the UBM weight supervector (2048 dimensional vector in this study).L is a matrix of dimension (C × ρ) spanning a low-dimensional subspace. r is alow-dimensional vector that best describes the utterance-dependent weight offset Lr.

9


In this framework, neither subspace matrix L nor subspace vector r are constrainedto be non-negative. However, unlike the i-vector framework, the applied factoranalysis for estimating the subspace matrix L and the subspace vector r is constrainedsuch that the adapted GMM weights are non-negative and sum up to one. Theprocedure of calculating L and r involves a two-stage algorithm similar to Expectation-Maximization. In the first step, L is assumed to be known, and r is updated. Similarlyin the second step, r is assumed to be known and L is updated.

In the Expectation step, given an utteranceO, by solving the following constrainedoptimization problem, a maximum likelihood estimation of vector r is obtained.

max(γ′(O) log(b + Lr)

)(2.7)

subject toh(b + Lr) = 1 : Equality constraintb + Lr > 0 : Inequality constraint

where h is a row vector of dimension C with all elements equal to 1, γ =∑t[γ1,t, ..., γC,t]′,

and γc,t is the occupation count for class c and frame t. In the case of a squarefull-rank L, this constrained optimization problem can be solved analytically asfollows

r(O) = L−1[1τγ(O)− b

](2.8)

However, since L is not a square full-rank, this constrained optimization doesnot have an analytical solution and should be solved using iterative optimizationapproaches. However, these methods are time-consuming for a large number ofutterances. Relaxing constraints and modifying this constrained optimization to anunconstrained optimization problem will decrease the computation time.

Since the UBM weights sum up to 1, hb+hLr = 1, which results in hLr = 0. Thisconstrain holds for any r if h is orthogonal to all columns of L. In Maximization-step,L is calculated in a way that hL = 0 holds.

If any of C inequality constraints in equation 2.7 is violated, the evaluation ofcost function will be impossible. This violation can be prevented by controlling thesteps of maximization approach. The exception is when any element of γ′(O) isequal to zero. By substitution of zero components of γ′(O) by very small positivevalues, this problem can also be eliminated.

At this moment, which the constrained optimization problem is converted to anunconstrained optimization problem, various optimization techniques such as gradientascent algorithm can be utilized to calculate the maximum likelihood estimation ofr. The gradient ascent algorithm has the following formula

ri = ri−1 + αE∇f(ri−1) (2.9)

∇f(r) = L′ [γ′(O)][b + Lr(O)] (2.10)

where i is the index for gradient ascent iterations, αE is the learning rate, and ∇denotes the gradient operator.

10


If a Gaussian distribution is considered for the prior of r, the objective functionin equation 2.7 and its gradient (given in equation 2.10) are modified to the followingforms

f(r) = γ′(O) log(b + Lr)− 12δ2 r′r (2.11)

∇f(r) = L′ [γ′(O)][b + Lr(O)] −

rδ2 (2.12)

where δ is the standard deviation of the prior distribution, which forces r to havesmall elements. In order to keep w non-negative, the variance of the Gaussiandistribution of the prior selects a small value.

In the Maximization-step, r is assumed to be known for all utterances in thetraining database. Thus, L can be calculated by solving the following constrainedoptimization problem:

max(∑

s

γ′(O(s)) log[b + Lr(O(s))])

(2.13)

subject toh(b + Lr(O(s)) = 1 : Equality constraintb + Lr(O(s)) > 0 : Inequality constraint

To solve this constrained optimization problem, iterative optimization techniquesshould be employed. Like in the Expectation-step, all equality constraints mentionedin equation 2.13 can be simplified to a single constraint hL = 0. By controlling thesteps size, the inequality constraints can also be avoided.

Solving this optimization problem using the projected gradient algorithm [97]results in the following equations:

Li = Li−1 + αMP∇f(Li−1) (2.14)

∇f(L) =∑s

[γ(O(s))][b + Lr(O(s))]r

′(O(s)) (2.15)

P = I− 1Ch′h (2.16)

where i is the index for gradient ascent iterations, αM is the learning rate and I isan identity matrix of dimension C.

The subspace matrix L is estimated over a large training dataset. The obtainedsubspace vectors representing the utterances in train and test datasets are used toestimate the age of speakers in this chapter. This new low-dimensional utterancerepresentation approach was successfully applied to speaker and language/dialectrecognition tasks [8].

11


2.2.3 Function Approximation

In this study, two different function approximation approaches, namely artificialneural networks (ANNs) and least-squares support vector regression (LSSVR) areemployed for the i-vector-based and the NFA-based function approximations. Inaddition, an ANN is used to perform the score-level fusion.

Artificial Neural Networks

A multilayer perceptron (MLP) is a supervised, feed-forward neural network, whichis widely applied to regression problems due to their ability to approximate complexnonlinear functions from input data [50, 54]. An MLP usually utilizes a derivativebased optimization algorithm such as back-propagation, to train the network. Differ-ent training methods have been suggested during the last decades [50, 86, 46, 65] toenhance the training speed, provide more memory efficient methods and representbetter convergence properties.

A feedforward neural network has a layered structure: an input layer, hiddenlayers and an output layer. An input layer consists of sensory nodes. The numberof input layer neurons equals to the dimension of the data. There are one or morehidden layers of computational nodes. Since there is no general rule to calculatean appropriate number of hidden neurons, it should be selected by a trial-and-errorprocedure. The output layer calculates the outputs of the network.

The activation functions commonly used in feedforward neural networks arelogistic, hyperbolic tangent and linear functions. Selecting an appropriate activationfunctions for the hidden layers as well as the output layer depends on applications.In function approximation problems, a linear function should be considered as theactivation function for the output layer, while all types of activation functions canbe chosen for hidden layers. The concept and relations of the MLPs are explained inmore detail in Appendix A

In this study, numerous network architectures consisting of different number ofhidden layers and hidden neurons, various activation functions and variety of trainingalgorithms are trained. Then, the trained networks are tested on the validation data,and based on the obtained results, the best network architectures are selected to beevaluated on the test data. In this study, networks are implemented, trained andtested using Matlab Neural Network toolbox, version 6.0.2.

Least Squares Support Vector Regression

Support vector regression (SVR) is a function approximation approach developed as aregression version of the widely known Support Vector Machines (SVM) classifier [96].Using nonlinear transformations, SVMs map the input data into a higher dimensionalspace in which a linear solution can be calculated. They also keep a subset of thesamples which are the most relevant data for the solution and discard the rest. Thismakes the solution as sparse as possible. While SVMs perform the classification taskby determining the maximum margin separation hyperplane between two classes, SVR

12

2.3. Experimental Setup

carries out the regression task by finding the optimal regression hyperplane in whichmost of training samples lie within an ε-margin around this hyperplane [96, 100].

In this study, we use the least squares version of support vector regression(LSSVR). While a SVR solves a quadratic programming with linear inequalityconstraints, which results in high algorithmic complexity and memory requirement, aLSSVR involves solving a set of linear equations by considering equality constraintsinstead of inequalities for classical SVR [100], which speeds up the calculations.This simplicity is achieved at the expense of loss of sparseness, therefore all samplescontribute to the model, and consequently, the model often becomes unnecessarilylarge. The concept and relations of the LSSVR are explained in more detail inAppendix B

In this chapter, linear and radial basis function (RBF) kernels are used toapproximate g(x). For the LSSVR with RBF kernels, a K-fold cross-validationis used to tune the smoothing parameter of the kernels. The LSSVR model fortraining and testing is implemented using LS-SVMlab1.8 Toolbox [31, 100] in Matlabenvironment.

2.2.4 Training and Testing

The proposed age estimation approach is depicted in Figure 2.1. As illustrated inthe figure, each utterance of train, development and test sets is mapped onto ahigh dimensional vector using one of the mentioned utterance modeling approachesdescribed in Section 2.2.2. During the training phase, the obtained i-vectors andNFA vectors of the training set are used as features with their corresponding agelabels to train the model-1 and the model-2, respectively.

During the development phase, the trained models estimate the age of utterancesof the development set. To this end, the obtained i-vectors and NFA vectors ofthe development set are applied to the trained model-1 and model-2, respectively.The outputs of model-1 and model-2 are then concatenated to form a two dimen-sional vector of estimated ages. This vector along with corresponding age labels ofdevelopment set are applied to train the model-3 to fuse the results.

Finally, during the testing phase, the trained models estimate the age of utterancesof unseen speakers (utterances of the test set). This is performed by applying theobtained i-vectors and NFA vectors of the test set to the trained model-1 and model-2,respectively. Then, the outputs of model-1 and model-2 are concatenated to forma two-dimensional vector of estimated ages. This vector is applied to the trainedmodel-3 to estimate the age of the test utterances.

2.3 Experimental Setup

2.3.1 Corpus

The National Institute for Standard and Technology (NIST) have held annual orbiannual speaker recognition evaluations (SRE) for the past two decades. Witheach SRE, a large corpus of telephone (and more recently microphone) conversations

13


Estimated

Age

Model 3 y

devy

veci

devy−

ˆ

NFA

devy

veci

tsty−

ˆNFA

tsty

Model 1

try

veci

devx−

U.M. U.M.

devx

veci

tstx−

tstx

veci

trx−

Model 2

try

U.M. U.M.

devx tstx

NFA

trx

NFA

devxNFA

tstx

trx

U.M

.

trx

U.M

.

Figure 2.1: Block diagram of the proposed speaker age estimation approach. (U.M.stands for utterance modeling)

are released along with an evaluation protocol. These conversations typically last 5minutes and originate from a large number of participants for whom additional metadata is recorded including age, height, weight, language and smoking habits. TheNIST databases were chosen for this work due to the large number of speakers andbecause the total variability subspace requires a considerable amount of developmentdata for training. The development data set used to train the total variabilitysubspace and UBM includes over 30,000 speech recordings and was sourced fromthe NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase III andSwitchboard Cellular (parts 1 and 2).

For the purpose of automatic speaker age estimation, telephone recordings fromthe common protocols of the recent NIST 2008 and 2010 SRE databases are used.Speakers of NIST 2008 and 2010 SRE databases are pooled together to create adataset of 1445 speakers. Then, they are divided into two disjoint parts such that80% and 20% of all speakers are assigned for training and testing sets, respectively.

14


10 20 30 40 50 60 70 80 900

100

200

300

400

500

600

700

800

Age

Nu

mb

er o

f U

tter

ance

s

Training Set / MALE

20 30 40 50 60 70 800

20

40

60

80

100

120

140

Age

Nu

mb

er o

f U

tter

ance

s

Testing Set / MALE

10 20 30 40 50 60 70 80 90 1000

200

400

600

800

1000

1200

Age

Nu

mb

er o

f U

tter

ance

s

Training Set / FEMALE

20 25 30 35 40 45 50 55 60 65 700

20

40

60

80

100

120

140

160

180

200

Age

Nu

mb

er o

f U

tter

ance

s

Testing Set / FEMALE

Figure 2.2: The age histogram of telephone speech utterances of training andtesting datasets for male and female speaker.

The age histogram of training and testing datasets for male and female speakers oftarget are depicted in Fig. 2.2.

The training set is also divided into two disjoint parts such that 20% of thetraining data are considered as the development set. Since there are several utterancesfrom each speaker in the data set, the development set was selected such that therewas no speaker who had utterances in both training and development sets. Thus, ofall 6080 utterances, 3194 utterances are considered for training set, 1692 utterancesare considered for development set, and 1194 utterances are considered for testingset.

2.3.2 Performance Metric

In order to evaluate the effectiveness of the proposed system, the mean-absolute-error(MAE) of the estimated age, and the Pearson correlation coefficient (CC) between

15


the actual and estimated speakers’ age are used. MAE is defined as:

MAE = 1N

N∑i=1|fi − yi| (2.17)

where fi is the ith estimated age and yi is the ith actual age, and N is the totalnumber of test samples.

Although MAE is a helpful performance metric in regression problems, it islimited in some respects specially in the case of a test set with a skewed distribution.Therefore, we use correlation coefficient, which is computed as:

CC = 1N − 1

N∑i=1

(fi − fsf

)(yi − ysy

), (2.18)

where f and sf denote sample mean and standard deviation, respectively.

2.4 Results and discussionIn this section, the proposed speaker age estimation approach is evaluated. Theacoustic feature vector is a 60-dimensional vector consists of 20 Mel-FrequencyCepstrum Coefficients (MFCCs) including energy appended with their first andsecond order derivatives. MFCCs are obtained using cosine transform of the reallogarithm of the short-term energy spectrum represented on a mel-frequency scale [80].This type of feature is very common in the i-vector-based speaker recognition systems.Wiener filtering, feature warping [80] and voice activity detection [68] have also beenconsidered in the front-end processing to obtain more reliable features.

In this study, two different function approximation approaches, namely artificialneural networks (ANNs) and least squares support vector regression (LSSVR) wereused for model-1 and model-2. In each experiment, a similar function approximationwas considered for both models. In other words, when an MLP (as model-1) wastrained with the i-vectors, an MLP with similar architecture was also utilized to betrained with the NFA vectors, as model-2. Likewise, when an LSSVR was employedas model-1, a similar LSSVR was also used as model-2.

Artificial neural networks used in model-1 and model-2 were trained using differentnumber of hidden layers, different number of hidden neurons, various learningalgorithms and the variety of activation functions. Then, based on the obtainedresults on the development set, the best network architecture has been selected forfurther experiments.

Since the experiments related to the male and female speakers were performedseparately, two different network architectures, namely a three-layer NN and afour-layer NN were employed for each group of gender.

For male speakers, the three-layer NN consisted of 200 hidden neurons, and thefour-layer NN was composed of 400 neurons in the first hidden layer and 200 neuronsin the second hidden layer. For female speakers, the three-layer NN had 150 hiddenneurons, and the four-layer NN was composed of 100 neurons in the first hidden

16

2.4. Results and discussion

Table 2.1: The results of speaker age estimation using different utterance modelingmethods (the i-vector and the NFA frameworks), and different function approximationtechniques (LSSVR and MLPs). CC is the Pearson correlation coefficient betweenactual and estimated age, and MAE is the mean-absolute error between actual and

estimated age in year.

Function Male FemaleApproximation i-vector NFA i-vector NFAMethods CC MAE CC MAE CC MAE CC MAELSSVR (RBF) 0.68 8.73 0.41 10.71 0.77 7.68 0.62 9.55LSSVR (Linear) 0.71 8.52 0.44 10.79 0.79 7.57 0.63 9.90Three-Layer NN 0.71 8.52 0.38 11.05 0.79 6.10 0.68 8.16Four-Layer NN 0.73 8.21 0.38 10.70 0.82 7.27 0.54 10.14

∗ The bold numbers in the table indicate the best results.

layer and 400 neurons in the second hidden layer. The preferred activation functionfor hidden layers was logistic sigmoid function, and in order to perform regression,a linear activation function has been utilized for the output layers. Among thevarious training algorithms described in Section A.2.1 of the Appendix A, the “scaledconjugate gradient” and “one step secant back-propagation” algorithms were appliedfor networks related to males and females, respectively. Networks were trained tominimize the mean-absolute-error between the desired and estimated outputs. Toattenuate the effect of random initialization, each experiment was repeated 10 times,and the most observed result was reported.

Another function approximation approach used for model-1 and model-2 wasLSSVR. Two different kernels, namely linear and radial basis function (RBF) kernelshave been used. The hyper-parameters of the RBF kernel have been tuned using a10-fold cross-validation. After optimization of the hyper parameters, models weretrained separately using the i-vectors and the NFA vectors.

The mean-absolute-error (MAE) between estimated and actual age along with thePearson correlation coefficient (CC) between estimated and actual age are presentedin Table 2.1. This table lists the results of using LSSVR and ANNs as functionapproximation methods for model-1 and model-2 before the score-level fusion. Asthis table shows, for this problem, LSSVR with linear kernel outperforms LSSVRwith RBF kernel. It also shows that if a four-layer NN is used for both model-1 andmodel-2, more accurate results will be obtained compared with using an LSSVR ora three-layer NN.

It can also be inferred from the table that the i-vector framework, which is basedon the Gaussian means, is more accurate in estimating age than the NFA frameworkwhich is based on the Gaussian weights.

Experimental studies show that although GMM weights convey less informationthan GMM means, they are complementary. For instance, Li et al., improvedthe speaker age group recognition performance by performing score-level fusionof classifiers based on GMM weights and GMM means [62]. Zang et al. applied

17


Table 2.2: The results of proposed speaker age estimation after score-level fusion,along with the relative improvements (R.I.) in CC and MAE, compared with theresults of the i-vector-based models. CC is the Pearson correlation coefficient betweenactual and estimated age, and MAE is the mean-absolute error between actual and

estimated age in year.

Function Male FemaleApproximation CC R.I. MAE R.I. CC R.I. MAE R.I.LSSVR (RBF) 0.71 4.2% 7.86 9.9% 0.82 6.1% 6.30 17.9%LSSVR (Linear) 0.76 6.6% 6.97 18.2% 0.85 7.1% 5.92 21.8%Three-Layer NN 0.74 4.1% 7.21 15.4% 0.82 3.6% 6.30 -3.2%Four-Layer NN 0.75 2.7% 7.19 12.4% 0.85 3.5% 6.12 15.8%∗ The bold numbers in the table indicate the best results.

GMM weight adaptation in conjunction with GMM mean adaptation for a largevocabulary speech recognition system to improve the word error rate [115]. In [11]the feature-level fusion of the i-vectors, GMM mean supervectors, and GMM weightsupervectors was applied to improve the accuracy of accent recognition. Therefore,in this study, score-level fusion of the i-vector-based and the NFA-based estimatorswas applied to enhance the accuracy of age estimation.

The fusion procedure, as described in Section 2.2.4, was performed by training athree-layer NN on the outputs of model-1 (the i-vector-based model) and model-2(the NFA-based model) on the development dataset. The architecture of fusionnetwork (model-3) consisted of 5 hidden neurons and logistic activation function inhidden layer, and one linear neuron at the output layer. The training algorithm usedto train the network was “Gradient descent with momentum and adaptive learningrate back-propagation”.

Table 2.2 represents the results of speaker age estimation after score-level fusion,along with the relative improvements in CC and MAE compared with the results ofthe i-vector-based models. It can be interpreted from the results that the score-levelfusion of the i-vector-based and the NFA-based estimators results in an improvementin accuracy of automatic speaker age estimation. This improvement was moreconsiderable when the outputs of the linear LSSVR models were fused. When themale and female data were pooled together, the correlation coefficient is equal to0.82.

The results of speaker age estimation reported in [9] are considered as the baseline,since they were obtained using the same databases as this study. The minimum MAEof the baseline system for male and female speakers were 7.63 and 7.61, respectively.The proposed method has improved the MAE of the baseline for males and femalesby 8.6% and 22.2%, respectively.

2.5 ConclusionIn this chapter, a novel approach for speaker age estimation based on a hybridarchitecture of the i-vector and the NFA frameworks was proposed. This architecture

18

2.5. Conclusion

consisted of two subsystems based on the i-vectors and the NFA vectors. To performthe age estimation, two different function approximation approaches, namely LSSVRand ANNs were used and compared in this study. The score-level fusion of thei-vector-based and the NFA-based estimators was also considered to improve theperformance.

The effectiveness of the proposed method was investigated on spontaneous tele-phone speech signals of the NIST 2008 and 2010 SRE corpora. The obtained resultsdemonstrated that employing the information in the GMM weights in conjunctionwith the information in the GMM means, in the form of the score-level fusion of thei-vector-based and the NFA-based estimators, not only decreased the mean-absolute-error between actual and estimated age, but also improved the Pearson correlationcoefficient between estimated and actual age, compared with the i-vector framework.The relative improvements in CC after score-level fusion for males and females were6.6% and 7.1%, respectively.

The proposed method has improved the MAE of the baseline (provided on thesame databases) for males and females by 8.6% and 22.2%, respectively, which reflectsthe effectiveness of the proposed method in automatic speaker age estimation.

19

Chapter 3

Automatic Speaker’s HeightEstimation from SpontaneousTelephone Speech

3.1 Introduction

Speaker body size (height/weight) estimation is an interesting, important and chal-lenging task in forensic and medical as well as commercial applications. In forensicscenarios, estimation of suspect’s body size can direct investigations to find cuesin judicial cases. In service customization, body size estimation may help usersto receive services proportional to their physical conditions. The wide range ofapplications has motivated researchers to look for acoustic cues which can be morebeneficial for speaker body size estimation. In this chapter, we focus on speakerheight estimation; the next chapter is devoted to speaker weight estimation.

Experimental studies have found different acoustic cues for speaker height esti-mation [104, 48]. However, the relation of these acoustic cues with speaker height isusually complex and affected by many other factors such as speech content, language,gender, weight, emotional condition, smoking and drinking habits. Furthermore, inmany practical cases we have no control over the available speech duration, content,language, environment, recording device and channel conditions. Therefore, heightestimation from speech signals is a very challenging task.

Previous studies have investigated a correlation between the speech signal of aperson and his/her height. In experiments conducted by Van Dommelen and Moxness,the ability of listeners to estimate the height of speakers from their voice have beenexamined, and significant correlations between estimated and actual height of malespeakers were reported [104]. In studies on speech-driven automatic height estimation,several resources have been devoted to identify acoustic features of speech that canconvey information about speaker height. For example, [104] and [48] analyzed thecorrelation between speaker height and formant frequencies, based on the assumptionof speech production theory that there is a correlation between a person’s vocaltract length (VTL) and his/her height. Recently, Arsikere et al. proposed a new

21

3. Automatic Speaker’s Height Estimation from SpontaneousTelephone Speech

algorithm based on the assumption of the uniform tube model of the subglottalsystem to estimate the speakers’ height from the subglottal resonances (SGRs) [4, 5].In other studies, Pellom and Hansen performed height group recognition by applyingMel-frequency cepstral coefficients (MFCCs) to train a height-dependent Gaussianmixture model. Then a maximum a posteriori classification rule was used to assigneach audio file to one of several height groups [81]. However, this text independentapproach does not estimate the actual height of a speaker, which can be achieved byusing regression techniques. Ganchev et al. applied a large set of openSmile audiodescriptors and performed support vector regression to estimate the height of a testspeaker [44].

In this chapter, a new speech-based method for automatic height estimationbased on the i-vectors instead of raw acoustic features (which utilized in previousstudies) is proposed [83]. One effective approach to speaker recognition involvesmodeling speech utterances with GMM mean super vectors [25]. Although GMMmean super vectors are effective, it is difficult to obtain a robust model when limiteddata are available due to the high dimensional nature of these vectors. In the field ofspeaker recognition, recent advances using the i-vector framework [33] have increasedthe classification accuracy considerably. The i-vector is a compact representation ofan utterance in the form of a low-dimensional feature vector.

To select an accurate regression approach for this problem, two different functionapproximation approaches, namely LSSVR and ANNs are compared. The effect ofthe kernel in LSSVR is also investigated. Evaluation on the NIST 2008 and 2010SRE corpora shows the effectiveness of the proposed approach.

The rest of the chapter is organized as follows. In Section 3.2 the problem ofautomatic height estimation is formulated and the proposed approach is described.Section 3.3 explains the experimental setup. The evaluation results are presentedand discussed in Section 3.4. The chapter ends with conclusions in Section 3.5.



In the speaker height estimation problem, we are given a set of training dataD = xi, yiNi=1, where xi ∈ Rp denotes the ith utterance and yi ∈ R denotes thecorresponding height. The goal is to design an estimation function g, such that for anutterance of an unseen speaker xtst, the estimated height y = g(xtst) approximatesthe actual height as good as possible in some predefined sense.


The first step for speaker height estimation is converting variable-duration speechsignals into fixed-dimensional vectors suitable for regression algorithms, which isperformed by fitting a GMM to acoustic features extracted from each speech signal.The parameters of the obtained GMMs characterize the corresponding utterance.

22


Due to limited data, we are not able to accurately fit a separate GMM for ashort utterance, specially in the case of GMMs with a high number of Gaussiancomponents. Thus, for adapting a UBM to characteristics of utterances in trainingand testing databases, parametric utterance adaptation techniques are applied. Inthis chapter, the i-vector framework is applied to adapt UBM means. The UBM andthe methods of UBM means adaptation using the i-vector framework are explainedin Chapter 2, in Section 2.2.2.


In this study, two different function approximation approaches, namely artificialneural networks (ANNs) and least-squares support vector regression (LSSVR) areemployed, which are described in Chapter 2, in Sections 2.2.3 and 2.2.3, respectively.


The proposed height estimation approach is depicted in Figure 3.1. During thetraining phase, each utterance is mapped onto a high dimensional vector using thei-vector framework described in Section 2.2.2. The obtained vectors of the trainingset are then used as features with their corresponding height labels to train anestimator for approximating function g.

During the testing phase, the utterance modeling approach is applied to extracta high dimensional vector from an unseen test utterance and the estimated height isobtained using the trained regression function.


3.3.1 Database

For this work, the NIST SRE databases were chosen due to the large number ofspeakers and because the total variability subspace requires a considerable amountof development data for training. The development data set used to train the totalvariability subspace and UBM includes over 30,000 speech recordings and was sourcedfrom the NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase IIIand Switchboard Cellular (parts 1 and 2).

For the purpose of height estimation, telephone recordings from the commonprotocols of the recent NIST 2008 and 2010 SRE databases are used for trainingand testing, respectively. The core protocol, short2-short3, from the 2008 databasecontains 3999 telephone recordings of 1236 speakers whose height is known. Similarly,the extended core-core protocol of the 2010 database contains 5792 telephone segmentsfrom 445 speakers. The height histogram of male and female speakers of NIST 2008and 2010 SRE databases of target are depicted in Figure 3.2. The training set isalso divided into two disjoint parts such that 25% of the training data are consideredas the development set. Since there are several utterances from each speaker in the

23


Figure 3.1: Block diagram of the proposed speaker height estimation approach intraining and testing phases.

data set, the development set was selected such that there was no speaker who hadutterances in both training and development sets.


In order to evaluate the effectiveness of the proposed system, the mean absolute error(MAE) of the speakers’ estimated height, and the Pearson correlation coefficient (CC)between the actual and estimated speakers’ height are used, which are described inChapter 2, Section 2.3.2.

In some literature regarding the estimation of speakers body size, researchersevaluate the performance of the systems by means of the mean-absolute-error betweenthe actual and the estimated values. Although MAE is a helpful performance metricin regression problems, it is limited in some respects, specially in the case of a test setwith a skewed distribution which is the case in this problem. For instance, considerthe most basic estimator which its output is the average height of the training data.When a test set with a skewed distribution is applied to this basic estimator, basedon the variance of the data, the mean-absolute-error might be in an acceptable range.However, the CC would be zero. For this reason, the Pearson correlation coefficientis a preferred performance metric in this problem.

24


Figure 3.2: The height histogram of telephone speech utterances for the NIST 2008and NIST 2010 databases.

3.4 Results and discussion

In this section, the proposed speaker height estimation approach is evaluated. Theacoustic feature vector used in this study is a 60-dimensional vector which consistsof 20 Mel-Frequency Cepstrum Coefficients (MFCCs) including energy appendedwith their first and second order derivatives. MFCCs are obtained using cosinetransform of the real logarithm of the short-term energy spectrum expressed on amel-frequency scale [80]. This type of feature is very common in the i-vector-basedspeaker recognition systems. Voice activity detection [68], feature warping [80] andWiener filtering have also been considered in the front-end processing to obtain morereliable features.

In this study, ANNs and LSSVR were utilized as function approximation tech-niques. ANNs were trained using different number of hidden layers and hiddenneurons, various learning algorithms and different activation functions. Then, basedon the obtained results on the development set, the best network architecture hasbeen selected to be evaluated on the test data. Therefore, two different network

25


Table 3.1: Results of speaker height estimation using ANNs and LSSVR. CC isthe Pearson correlation coefficient between actual and estimated height.

Function CCHeightApproximation Male FemaleLSSVR (RBF kernel) 0.30 0.23LSSVR (Linear kernel) 0.41 0.40Three-Layer NN 0.35 0.36Four-Layer NN 0.36 0.35

∗ The bold numbers in the table indicate the best results.

architectures, namely three-layer NN and four-layer NN were employed.For the three-layer NN, 10 hidden neurons and for the four-layer NN, 20 neurons

in the first hidden layer and 5 neurons in the second hidden layer have been selected,respectively. The preferred activation function for hidden layers was logistic sigmoidfunction, and in order to perform regression, a linear activation function has beenutilized for the output layers. Among the various training algorithms describedin Section A.2.1 of the Appendix A, the “BFGS quasi-Newton backpropagation”algorithm were employed. To attenuate the effect of random initialization, thetraining and testing phases of each experiment was repeated 20 times. Networkswere trained to minimize the mean-absolute-error between the desired and estimatedoutputs. In this study, networks were implemented, trained and tested using MatlabNeural Network toolbox, version 6.0.2.

In this chapter, in order to investigate the effect of kernels in LSSVR, twodifferent kernels, namely linear and radial basis function (RBF) kernels have beenused. The hyper-parameters of the RBF kernel have been tuned using a 5-fold cross-validation. After optimization of the hyper parameters, the model has been trained.The LSSVR models for training and testing were implemented using LS-SVMlab1.8Toolbox [31, 100] in Matlab environment.

The results of automatic height estimation using LSSVR and ANNs as functionapproximation methods for male and female speakers are reported in Table 3.1.The obtained results indicate that, LSSVR with linear kernel outperforms ANNsin speaker height estimation. It can be inferred from the results that the linearkernel is more effective than the RBF kernel in this problem. In this case, thecorrelation coefficients for male speakers, female speakers and when the male andfemale data were pooled together are 0.40, 0.41 and 0.60 respectively. The scatterplots of estimation for male speakers, female speakers and when the male and femaledata were pooled together are shown in Figure 3.3(a), Figure 3.3(b) and Figure 3.3(c),respectively.

The mean-absolute-error (MAE) of estimation is 6.2 cm and 5.8 cm for male andfemale speakers, respectively. As stated before, the MAE is limited in some respects,particularly, in the case of a test set with a skewed distribution which is the casein this height estimation task. This limitation is highlighted by considering a basicestimator, which its output is the average height of the training data. When the test

26

3.5. Conclusion

set was applied to this estimator, the MAE for male and female speakers were 6.54cm and 5.9 cm, respectively. However, the measured CC for males and females wereequal to zero. For this reason, the correlation coefficient is a preferred performancemetric in this problem which reflects the performance of the estimators in a moretangible way.

Although the obtained MAE is satisfactory and the correlation coefficient is fairlystrong when male and female data are pooled together, the CC within male andfemale speakers requires improvement. Unfortunately there is no published resultson the same database for comparison purpose. However, the results of publishedpapers on other datasets indicate the typical range of performance in automaticspeaker height estimation problem. In [5], reported CC of speaker height estimationon TIMIT database using a method based on sub-glottal resonances [4] are 0.12,0.21 and 0.71 for male speakers, female speakers and when the male and femaledata were pooled together respectively. In [81], the obtained CC of speaker heightestimation for male and female speakers of TIMIT database using a GMM basedapproach are 0.39 and 0.31 respectively. The obtained results seem to be reasonable,considering that the applied testing dataset in this study consists of spontaneoustelephone speech signals and the number of test set speakers in this study (3999telephone recordings of 1236 speakers) is considerably larger than that of in [5] and[81].

3.5 ConclusionIn this chapter a novel approach for automatic speaker height estimation based onthe i-vector framework was proposed. In this method, each utterance was modeledby its corresponding i-vector. Then, ANNs and LSSVR were employed to estimatethe height of a speaker from a given utterance. The proposed method was trainedand tested on the telephone speech signals of NIST 2008 and 2010 SRE corporarespectively. Evaluation results showed the effectiveness of the proposed method inspeaker height estimation.

27


150 160 170 180 190 200 210

150

160

170

180

190

200

210

Target

: R=0.40846

DataFitY = T

Ou

tpu

t

(a)

145 150 155 160 165 170 175 180 185 190

145

150

155

160

165

170

175

180

185

190

Target

: R=0.39379

DataFitY = T

Ou

tpu

t

(b)

150 160 170 180 190 200 210

150

160

170

180

190

200

210

Target

: R=0.59371

DataFitY = T

Ou

tpu

t

(c)

Figure 3.3: The scatter plot of height estimation for (a): male speakers, (b): femalespeakers, and (c): when the male and female data were pooled together.

28

Chapter 4

Automatic Speaker’s WeightEstimation from SpontaneousTelephone Speech

4.1 Introduction

The acoustic features of speakers have been postulated to convey information aboutspeakers such as age, gender, body size and emotional state. Estimation of speakerheight was investigated in the previous chapter. In this chapter, we focus onspeaker weight estimation. Automatic speaker weight estimation is one aspect ofspeaker profiling systems which has a wide range of forensic, medical and commercialapplications in real-world scenarios.

Since the size of various components of the sound production system such asvocal folds and vocal tract may be related to the overall weight or height of a speaker,it has been a good motivation for researchers in the field of speaker recognition toinvestigate whether some features of the acoustic signal may provide cues to the bodysize of the speaker. For instance, authors in [41, 76] found a relationship betweenformants and the length of the vocal tract, based on the source-filter theory. Thus,since the vocal tract is a part of the speaker’s body, this feature can be used toestimate the body size of a speaker [61]. Mean fundamental frequency (f0) of voiceis also reported as a feature which has a (negative) correlation with body size. Thatis, females and children have higher f0, and males (who are taller and heavier) havelower one [30, 72].

However, estimating speaker weight from the voice pattern is not a straightforwardproblem. By observing the results of researchers’ study on the relationship betweenvarious features of acoustic signal and the body size, the complexity of the issuewill be more trivial. For instance, when the relation of fundamental frequency (f0)and body size was investigated within male and female speakers, no correlation wasfound between f0 and the body size of adult humans [60, 59, 29, 103]. The lowestfundamental frequency of voice (Fmin0 ) is another feature which is determined bythe mass and length of the vocal folds [30]. By investigating this feature, researchers

29

4. Automatic Speaker’s Weight Estimation from SpontaneousTelephone Speech

have found no correlation between Fmin0 and body size in adult human speakers [60,59, 29, 103].

Fitch has found formant dispersion (the averaged difference between adjacentpair of formant frequencies) a reliable feature which has a correlation with bothvocal tract length and body size in macaques [42]. However, a weak relation betweenformant parameters and body size of human adults is reported in study conductedby Gonzalez [48]. The reason for this weak correlation is that in humans at puberty,the vocal folds grow independent of the rest of the head and body. This issue is moreevident in the males than the females [47, 78].

Gonzalez studied the correlation between formant frequencies and body size inhuman adults [48]. He calculated the formant parameters by means of a long-termaverage analysis of running speech signals uttered by 91 speakers. In this experiment,the Pearson correlation coefficients between formants and weights for male and femalespeakers were reported to be 0.33 and 0.34, respectively [48].

In research conducted by Van Dommelen and Moxness [104], the ability tojudge speakers weight from their speech samples was investigated. They reporteda significant correlation between estimated weight (judged by listeners) and actualweight of only male speakers. In addition, they performed regression analysis involvingseveral acoustic features such as f0, formant frequencies, energy below 1 kHz, andspeech rate. The results showed that the only parameter which had a significantcorrelation with male speaker’s weight was the speech rate. They concluded thatspeech rate of male speakers is a reliable predictor for weight estimation.

Modeling speech utterances with GMM mean supervectors is demonstrated tobe an effective approach to speaker recognition which has attracted the attention ofresearchers [25]. However, GMM mean supervectors are high dimensional vectorsand obtaining a reliable model is difficult when limited data are available.

Recently, utterance modeling using the i-vector framework have considerablyincreased the accuracy of the classification and regression problems in the field ofspeaker profiling [33, 83, 9]. The i-vector represents an utterance in a compact anda low-dimensional feature vector. In this chapter, a novel approach for automaticspeaker weight estimation based on the i-vectors instead of raw acoustic features isproposed. In this proposed method, each utterance is modeled by its correspondingi-vector. To select an accurate regression approach for this problem, two differentfunction approximation approaches, namely LSSVR and ANNs are compared. Theeffect of the kernel in LSSVR for the problem of speaker weight estimation is alsoinvestigated in this study. The proposed method is investigated on spontaneoustelephone speech signals of the NIST 2008 and 2010 SRE corpora. Experimentalresults confirm the effectiveness of the proposed approach.

The rest of the chapter is organized as follows. The problem of automatic weightestimation is formulated and the proposed approach is described in Section 4.2.Section 4.3 explains the experimental setup. The evaluation results are presentedand discussed in Section 4.4. The chapter concludes with conclusions in Section 4.5.

30



In this section, the problem formulation and the main constituents of the proposedmethod are described.


In the speaker weight estimation problem, we are given a set of training dataD = xi, yiNi=1, where xi ∈ Rp denotes the ith utterance and yi ∈ R denotes thecorresponding weight. The goal is to design an estimation function g, such that for anutterance of an unseen speaker xtst, the estimated weight y = g(xtst) approximatesthe actual weight as good as possible in some predefined sense.


By fitting a GMM to acoustic features extracted from each speech signal, variable-duration speech signals are converted into fixed-dimensional vectors suitable forregression algorithms. The parameters of the obtained GMMs characterize thecorresponding utterance.

Due to limited data, we are not able to accurately fit a separate GMM for ashort utterance, specially in the case of GMMs with a high number of Gaussiancomponents. Thus, for adapting a UBM to characteristics of utterances in trainingand testing databases, parametric utterance adaptation techniques are applied. Inthis chapter, the i-vector framework is applied to adapt UBM means. The UBM andthe methods of UBM means adaptation using the i-vector framework are explainedin Chapter 2, in Section 2.2.2.


In this study, two different function approximation approaches, namely artificialneural networks (ANNs) and least-squares support vector regression (LSSVR) areemployed, which are described in Chapter 2, in Sections 2.2.3 and 2.2.3, respectively.


The proposed weight estimation approach is depicted in Fig. 4.1. During the trainingphase, each utterance is mapped onto a high dimensional vector using the i-vectorframework described in Section 2.2.2. The obtained vectors of the training set arethen used as features with their corresponding weight labels to train an estimatorfor approximating function g.

During the testing phase, the i-vector extraction is applied to extract highdimensional vectors from an unseen test utterance and the estimated weight isobtained using the trained regression function.

31


Regression

Utterance

modeling

Utterance

modeling

Utterance

modeling

Utterance

modeling

x1 y1 xi yi xN yN

xtst

y

Training phase

Testing phase

… …

Estimated

Weight

Figure 4.1: Block diagram of the proposed speaker weight estimation approach intraining and testing phases.


4.3.1 Database

Since the National Institute for Standard and Technology (NIST) speaker recognitionevaluations (SRE) databases contain the large number of speakers and because thetotal variability subspace requires a large amount of development data for training,the NIST SRE databases were selected for this study. The development data setused to train the total variability subspace and UBM includes over 30,000 speechrecordings and was selected from the NIST 2004-2006 SRE databases, LDC releasesof Switchboard 2 phase III and Switchboard Cellular (parts 1 and 2).

For the purpose of automatic speaker weight estimation, telephone recordingsfrom the common protocols of the recent NIST 2008 and 2010 SRE databases areused. Speakers of NIST 2008 and 2010 SRE databases are pooled together to createa dataset of 1445 speakers. Then, they are divided into two disjoint parts such that80% and 20% of all speakers are used for training and testing sets, respectively. Theweight histogram of training and testing datasets for male and female speakers oftarget are depicted in Fig. 4.2.

The training set is also divided into two disjoint sets such that 20% of the trainingdata are considered as the development set, so that none of them were used in thetraining set. In addition, since there are several utterances from each speaker inthe data set, the development set was selected such that there was no speaker whohad utterances in both training and development sets. Thus, of all 6080 utterances,3194 utterances are considered for training set, 1692 utterances are considered fordevelopment set, and 1194 utterances are considered for testing set.

32


40 60 80 100 120 140 1600

100

200

300

400

500

600

700

Weight

Nu

mb

er o

f U

tter

ance

s

Training Set / MALE

40 60 80 100 120 140 1600

20

40

60

80

100

120

140

160

180

Weight

Nu

mb

er o

f U

tter

ance

s

Testing Set / MALE

20 40 60 80 100 120 140 1600

100

200

300

400

500

600

700

800

900

Weight

Nu

mb

er o

f U

tter

ance

s

Training Set / FEMALE

20 40 60 80 100 120 1400

50

100

150

200

250

Weight

Nu

mb

er o

f U

tter

ance

s

Testing Set / FEMALE

Figure 4.2: The weight histogram of telephone speech utterances of training andtesting datasets for male and female speaker.


In order to evaluate the effectiveness of the proposed system, we used the Pearsoncorrelation coefficient (CC) between the actual and estimated weights, and themean-absolute-error (MAE), which are described in Chapter 2, Section 2.3.2.

MAE is a helpful performance metric in regression problems. However, it islimited in some respects, specially in the case of a test set with a skewed distributionwhich is the case in this problem. For instance, suppose the most basic estimatorwhich its output is the average weight of the training data. When a test set witha skewed distribution is applied to this basic estimator, based on the variance ofthe data, the mean-absolute-error might be in an acceptable range. However, theCC would be zero. For this reason, the Pearson correlation coefficient is a preferredperformance metric in this problem.

33



In this section, the proposed speaker weight estimation approach is evaluated. Theacoustic feature consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs) in-cluding energy appended with their first and second order derivatives, forming a60 dimensional acoustic feature vector. This type of feature is very common in thei-vector-based speaker recognition systems. To have more reliable features, Wienerfiltering, speech activity detection [68] and feature warping [80] have been consideredin the front-end processing.

In this study, ANNs and LSSVR were utilized as function approximation tech-niques. ANNs were trained using different number of hidden layers and hiddenneurons, as well as various learning algorithms and activation functions. Then, basedon the obtained results on the development set, the best network architecture hasbeen selected to be evaluated on the test data.

Since the experiments related to the male and female speakers were performedseparately, two different network architectures, namely three-layer and four-layerneural networks were employed for each group of gender. For male speakers, thethree-layer NN consisted of 450 hidden neurons, and the four-layer NN was composedof 250 neurons in the first hidden layer and 400 neurons in the second hidden layer.For female speakers, the three-layer NN had 400 hidden neurons, and the four-layerNN was composed of 250 neurons in the first hidden layer and 400 neurons in thesecond hidden layer. The activation function of the hidden layers was logistic sigmoidfunction, and to perform regression, a linear activation function has been used forthe output layers. Of all various training algorithms described in Section A.2.1 of theAppendix A, the “scaled conjugate gradient” and “gradient descent with adaptivelearning rate backpropagation” algorithms were employed for male and female weightestimation, respectively. Networks were trained to minimize the mean-absolute-errorbetween the desired and estimated outputs. To attenuate the effect of randominitialization, each experiment was repeated 10 times, and the most observed resultwas reported. The networks were implemented, trained and tested using MatlabNeural Network toolbox, version 6.0.2.

In this study, in order to approximate the function g with LSSVR, two differentkernels, namely linear and radial basis function (RBF) kernels have been used.The hyper-parameters of the RBF kernel have been tuned using a 10-fold cross-validation. After optimization of the hyper parameters, the model has been trained.The LSSVR models for training and testing were implemented using LS-SVMlab1.8Toolbox [31, 100] in Matlab environment.

The results of using LSSVR and ANNs as function approximation methods formale and female speakers are represented in Table 4.1. The obtained results indicatethat, for speaker weight estimation problem, LSSVR with linear kernel outperformsLSSVR with RBF kernel. It can be interpreted from the table that a MLP is moreeffective in automatic speaker weight estimation than LSSVR. The obtained resultsshow that the proposed method is more effective in weight estimation for malespeakers than for female speakers.

The mean-absolute-error of estimation is 11.4 kg, 9.0 kg and 9.9 kg for male

34

4.5. Conclusion

Table 4.1: Results of speaker weight estimation using MLPs and LSSVR. CC isthe Pearson correlation coefficient between actual and estimated weight, and MAE

is the mean-absolute-error between actual and estimated weight, in kg.

Function Male FemaleApproximation CC MAE CC MAE

LSSVR (RBF kernel) 0.39 12.10 0.25 10.06LSSVR (Linear kernel) 0.42 12.11 0.30 9.77Three-Layer NN 0.52 11.44 0.36 9.06Four-Layer NN 0.50 11.45 0.39 9.00∗ The bold numbers in the table indicate the best results.

and female speakers and when the male and female data were pooled together,respectively. As mentioned earlier, the MAE is limited in some respects, specially, inthe case of a test set with a skewed distribution which is the case in this task. Thislimitation is highlighted by considering a basic estimator, which its output is theaverage weight of the speakers in the training data. When the test set was applied tothis estimator, the MAE for male and female speakers and when the male and femaledata were pooled together were 13.0 kg, 9.7 kg and 12.4 kg, respectively. However,the measured CC for males and females and when the male and female data werepooled together were equal to zero. For this reason, the correlation coefficient isa preferred performance metric in this task which reflects the performance of theestimators in a more sensible way.

The reported CC for automatic speaker weight estimation based on the formantparameters of the running speech signals uttered by 91 speakers are 0.33 and 0.34for male and female speakers, respectively [48]. However, in our proposed automaticspeaker weight estimation, which is based on the i-vector framework, the correlationcoefficients between actual and estimated weights of the male speakers, femalespeakers and when the male and female data were pooled together are 0.52, 0.39 and0.59, respectively. The obtained results seem to be reasonable, considering that theapplied testing dataset in this study consisted of spontaneous speech signals and thenumber of speakers in test set was considerably larger than that of in [48]. It canbe concluded that automatic speaker weight estimation using the i-vectors is moreefficient compared with estimation based on the raw acoustic features.

4.5 ConclusionIn this chapter a novel approach for automatic speaker weight estimation based onthe i-vector framework was proposed. In this method, each utterance was modeledby its corresponding i-vector. Then, ANNs with different architectures and LSSVRwith different kernels were employed to estimate the weight of a speaker from agiven utterance. The proposed method was trained and tested on the telephonespeech signals of NIST 2008 and 2010 SRE corpora. Evaluation results showed theeffectiveness of the proposed method in speaker weight estimation.

35

Chapter 5

Automatic Smoker Detectionfrom Spontaneous TelephoneSpeech

5.1 Introduction

Cigarette smoking habit is a feature that can be inferred form a speaker’s voice.Smoker detection as a component of speaker profiling systems and behavioral infor-matics is scrutinized in this chapter. The effect of smoking habit on different speechanalysis systems such as speaker gender detection, age estimation, intoxication-levelrecognition and emotional state identification shows the importance of an automaticsmoking habit detection system and motivates the analysis of the smoking habiteffects of speech signals.

Experimental studies show that many acoustic features of the speech signal suchas fundamental frequency, jitter, shimmer are influenced by cigarette smoking [98,109, 21, 45, 49]. For example, Gonzalez and Carpi studied the early effects of smokingon the voice parameters. They reported the differences between the perturbationparameters (smoothed pitch perturbation quotient and jitter) for early smokers andnon-smokers of both genders. They found an effect of smoking on the amplitude andfrequency tremor intensity indices for male test persons, and on the fundamentalfrequency parameters (highest, lowest and average fundamental frequencies) of theearly stage female smokers. They have also reported a correlation between thenumber of cigarettes smoked per day and the values of the fundamental frequencyin female smokers and the frequency tremor intensity index in the male smokers[49]. The effect of long-term smoking on the fundamental frequency of the maleand female smokers and non-smokers have been studied in [98]. In that study,the different effects of smoking habits on the fundamental frequency of the maleand the female smokers have been reported. It showed that, unlike for females, thedifferences between the fundamental frequency of the male smokers and non-smokerswas significant.

Although experimental studies reveal the effect of smoking on different acoustic

37

5. Automatic Smoker Detection from Spontaneous Telephone Speech

characteristics of speech, the relation of these acoustic cues with speaker smokinghabit is usually complex and affected by many other factors such as speaker age,gender, emotional condition and drinking habits [88, 12, 13]. Furthermore, inmany practical cases there is no control over the available speech duration, contentand language. These issues make smoking habit detection very challenging for bothhumans and machines [19, 62, 88]. Technical factors such as available speech duration,environment, recording device and channel conditions also influence the estimationaccuracy. In other words, in a typical practical scenario, the quality of the availablespeech signal and the recording conditions are not controlled and the duration of thespeech signal may vary from a few seconds to several hours.

In this study, an automatic smoker detection from spontaneous telephone speechsignals is proposed. To my knowledge, this is the first work on this topic andthus, the system performance can not be compared with any baseline. However,state-of-the-art techniques developed within speaker and language recognition fieldsare adopted and applied to reach a reasonable result.

One effective approach to speaker recognition involves modeling speech recordingswith GMM mean supervectors to use them as features in Support Vector Machines(SVM) [25]. Similar SVM techniques have been successfully applied to differentspeech processing tasks such as speaker age estimation [38, 19]. While effective,GMM mean supervectors are of a high dimensionality resulting in high computationalcost and difficulty in obtaining a robust model in the context of limited data. Inthe field of speaker and language recognition, recent advances using the i-vectorframework [33, 35], which provides a compact representation of an utterance in theform of a low-dimensional feature vector, have increased the classification accuracyconsiderably. The i-vectors successfully replaced GMM mean supervectors in speakerage estimation too [9].

Bahari et al. have recently introduced a new framework for adaptation anddecomposition of GMM weights based on a factor analysis similar to that of thei-vector framework [7, 34]. In this method, named non-negative factor analysis (NFA),the applied factor analysis is constrained such that the adapted GMM weights arenon-negative and sum to unity. This method, which yields new low-dimensionalutterance representation approach, was applied to speaker and language/dialectrecognition successfully [8].

In this chapter, a hybrid architecture involving the NFA and the i-vector frame-works for smoking habit detection is proposed. This architecture consists of twosubsystems based on the i-vectors and the NFA vectors. The score-level fusion ofthe i-vector-based and the NFA-based recognizers is also considered to improve theclassification accuracy. In this research, the effectiveness of the proposed methodis evaluated on spontaneous telephone speech signals of NIST 2008 and 2010 SREcorpora. Experimental results confirm the effectiveness of the proposed approach.

The rest of the chapter is organized as follows. In Section 5.2 the problem ofautomatic smoker detection is formulated and the proposed approach is described.Section 5.3 explains the experimental setup. The evaluation results are presentedand discussed in Section 5.4. The chapter ends with conclusions in Section 5.5.

38




In the smoking habit estimation problem, we are given a set of training data D =xi, yiNi=1, where xi ∈ Rd denotes the ith utterance and yi denotes the correspondingsmoking habit. The goal is to approximate a classifier function g, such that for anutterance of an unseen speaker, xtst, the probability of the estimated output classifiedin the correct class is maximum. That is, the estimated label, y = g(xtst), is as closeas the true label, as evaluated by some performance metrics (see Section 5.3.2).


The first step for smoking habit detection is converting variable-duration speechsignals into fixed-dimensional vectors suitable for classification algorithms, which isperformed by fitting a GMM to acoustic features extracted from each speech signal.The parameters of the obtained GMMs characterize the corresponding utterance.

When a limited data are available, fitting a separate GMM for a short utterancecan not be performed accurately, specially in the case of GMMs with a high number ofGaussians. Hence, for adapting a UBM to characteristics of utterances in training andtesting databases, parametric utterance adaptation techniques are usually applied.In this chapter, the i-vector framework for adapting UBM means and the NFAframework for adapting UBM weights are applied. The UBM and the methodsof UBM adaptation using the i-vector and the NFA frameworks are explained inChapter 2, in Section 2.2.2.

5.2.3 Classifiers

In order to find out suitable matches between the utterance modeling schemes and therecognizers in smoking habit detection problem, five different classifiers are employedwhich are described in the following subsections.

Logistic Regression (LR)

Logistic regression (LR) is a widely used classification method [18], which assumesthat

yi ∼ Bernoulli(f(θ′xi + θ0)) (5.1)

where yis are independent, θ is a vector with the same dimension of x, θ0 is aconstant and f(•) is a logistic function and defined as:

f(•) = 11 + e−(•) (5.2)

The output of the logistic function, as shown in Figure 5.1, is a value betweenzero and one.

In the problem of smoker detection, we intend to model the probability of asmoker speaker given his/her speech. That is, P (Smoker|xi) = f(θ′xi + θ0), where

39


Figure 5.1: Logistic function

xi is the feature vector corresponding to the ith utterance. Vector θ and constantθ0 are the model parameters, which are found through the maximum likelihoodestimation (MLE), and the prime denotes transpose. The concept and relations ofthe logistic regression as well as the MLE procedure to estimate the parameters ofthe model are explained in more detail in Appendix C.

5.2.4 Multilayer Perceptrons (MLPs)

A multilayer perceptron labeled as MLP in this chapter, is described in Chapter 2,in Section 2.2.3. The only difference is that when a MLP is employed to performclassification task, a logistic function is typically considered as the activation functionfor the output layer, while in regression problems it should be a linear function.Appendix A explains the concept and relations of the ANNs in more detail.

In this study, numerous network architectures consisting of different number ofhidden layers, different number of hidden neurons, various activation functions andvariety of training algorithms are trained. Then, the trained networks are tested onthe validation data, and based on the obtained results, the best network architecturesare selected to be evaluated on the test data. Networks are implemented, trainedand tested using Matlab Neural Network toolbox, version 6.0.2.

Naive Bayesian Classifier (NBC)

Bayesian classifiers are probabilistic classifiers working based on Bayes’ theorem andthe maximum posteriori hypothesis. They predict class membership probabilities,i.e., the probability that a given test sample belongs to a particular class. That is,

P (Cl | x) = P (x | Cl)P (Cl)P (x) (5.3)

where Cl is the label of lth class. Since P (x) is the same for all classes, the denominatorcan be ignored in calculations. However, calculating P (x | Cl) requires large trainingsamples and is computationally expensive.

The Naive Bayesian classifier (NBC) is a special case of Bayesian classifiers,which assumes class conditional independence to decrease the computational costand training data requirement [114]. Due to this assumption, P (xi | Cl) can be

40


determined independently. That is,

P (x | Cl) =N∏i=1

P (xi | Cl) (5.4)

In this study, class distributions are assumed to be Gaussian.

Gaussian Scoring (GS)

This classification approach, labeled as GS in this chapter, assumes that each classhas a Gaussian distribution and full covariance matrix is shared across all classes [67].

In this method, score of the test vector xtest for the lth class is obtained as follows

scorel = x′testΨ−1xl −12 x′lΨ−1xl , l = 1, 2 (5.5)

where prime denotes transpose, xl is the mean of the vectors for the lth class in thetraining dataset and Ψ is the common covariance matrix shared across all classes.Since the equation 5.5 is linear in xtest, this is a linear classifier.

Von-Mises-Fisher Scoring (VMF)

This classification approach, labeled as VMF in this chapter, works based on simplifiedVon-Mises-Fisher distribution [95] which is defined as

f(x|x, κ) = Cd(κ) exp(κx′x) (5.6)

where x is the mean, κ is the spread of the probability mass around the mean, Cd(κ)is the normalization constant, d is the dimension of x, and prime denotes transpose.In this equation, x′x is the cosine similarity between mean and x.

In this method, each class is modeled using Von-Mises-Fisher distribution. If themean of the vectors for the lth class, xl , is defined as

xl =∑Nli=1 xi

‖∑Nli=1 xi ‖

(5.7)

where Nl is the number of utterances for each class, VMF score of the test vector,xtest, for the lth class is calculated by a dot product of the test vector with the classmodel mean as follows

scorel = x′testxl , (5.8)


The proposed smoking habit estimation approach is depicted in Figure 5.2. Duringthe training phase, each utterance is mapped onto a high dimensional vector usingone of the mentioned utterance modeling approaches described in Section 2.2.2. The

41


Figure 5.2: Block-diagram of the proposed smoker detection approach in trainingand testing phases.

obtained vectors of the training set are then used as features with their correspondingsmoking habit labels to train a classifier.

During the testing phase, the utterance modeling approaches are applied toextract high dimensional vectors from an unseen test utterance and the smokinghabit is recognized using the trained classifier.


5.3.1 Database

For the purpose of smoker detection, telephone recordings from the common protocolsof the recent National Institute for Standard and Technology (NIST) 2008 and 2010Speaker Recognition Evaluation (SRE) databases are used due to the large numberof speakers and because the total variability subspace requires a considerable amountof development data for training. The development data set used to train the totalvariability subspace and UBM includes over 30,000 speech recordings and was sourcedfrom the NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase IIIand Switchboard Cellular (parts 1 and 2). Speakers of NIST 2008 and 2010 SREdatabases are pooled together to create a dataset of 1445 speakers. Then, they aredivided into three disjoint parts such that 60%, 20% and 20% of all speakers areused for training, development and testing respectively. Thus, of all 6080 utterances,3194 utterances are considered for training set, 1692 utterances are considered fordevelopment set, and 1194 utterances are considered for testing set. The smokinghabits histogram of male and female utterances (there might be multiple utterancesfrom each speaker) of training, development and testing databases are depicted inFigure 5.3(a) and Figure 5.3(b), respectively.

As depicted in these figures, the problem is dealing with an unbalanced datasets

42


1079

629

426411

400

600

800

1000

1200

Non-Smokers Smokers

11376

0

200

Training Set Development Set Test Set

(a)

1363

843

568

341

107 124

0

200

400

600

800

1000

1200

1400

1600

Training Set Development Set Test Set

Non-Smokers Smokers

(b)

Figure 5.3: The smoking habit histogram of telephone speech utterances fortraining, development and testing datasets for (a): male speakers and (b): female

speakers.

which can make the problem of classification more difficult. The effect of unbalancingin the database is alleviated by considering the distribution of each class of thetraining set into consideration during the training phase.

In any classification problem, we would ideally want to have a well-balanced testset, where all affecting parameters are kept the same across different categories. Inthe case of smoking habit detection, we desired a test set such that all parameterssuch as speech content, language, speaker age, alcohol consumption and ethnicity ofutterances in smoking and non-smoking categories are the same. However, formingsuch an ideal test set of too many utterances is usually very difficult and expensive.Therefore, like in many speech technology classification studies [64, 23, 55, 90], theapplied test set is formed by randomizing over all other factors.

43



Two performance metrics, namely minimum log-likelihood-ratio cost (Cllr,min) andarea under the ROC curve (AUC) are considered to evaluate the effectiveness of theproposed method. In this section, the applied performance measure methods aredescribed in brief.

Minimum Log-Likelihood Ratio Cost

Log-Likelihood Ratio Cost (Cllr) is a performance measure for classifiers with soft,probabilistic decisions output. Since this performance measure is independent ofthe prior distribution of the classes, it is application-independent. This methodhas been selected for use in the NIST SRE, and was initially developed for binaryclassification problems such as speaker recognition. It was further utilized in languageand dialect recognition problems [24, 8, 6] which are multi-class classification tasks.Cllr is defined as:

Cllr = − 1N

N∑i=1

wi log2 Pi (5.9)

where Pi is the posterior probability for the true class of the ith utterance, wi is aweighting factor to normalize the class proportions, and N is the total number oftest samples. Cllr has the sense of cost. That is, a classifier with smaller Cllr is abetter classifier. For binary classifiers, the reference value of Cllr is log2 2 = 1, whichindicates a useless classifier which extracts no information from the voice samples.Cllr < 1 is an indication of a useful classifier which can be expected to make decisionswhich have lower average cost than decisions based on the prior.

Cllr,min represents the minimum possible Cllr which can be achieved for anoptimally calibrated system [106, 24]. In this study, in order to calculate Cllr,min,the FoCal Multiclass Toolkit [22] is utilized.

Area Under the ROC Curve (AUC)

Classifiers can also be evaluated by comparing their area under the Receiver Operat-ing Characteristic (ROC) curves (AUCs). ROC curve is a widely used approach tomeasure the efficiency of classifiers. In a ROC curve the true positive rate (Sensitivity)is plotted as a function of the false positive rate (1-Specificity) for different operatingpoints. Each point on the ROC curve represents a sensitivity/specificity pair corre-sponding to a particular decision threshold. A classifier with perfect discriminationhas a ROC curve that passes through the upper left corner (100% sensitivity, 100%specificity). Therefore the closer the ROC curve is to the upper left corner, thehigher the overall accuracy of the test [117]. Thus, the area under the ROC curvetakes a value between 0 and 1. This value for a perfect classifier is 1, and for adefault system (posterior equal to the prior), it is 0.5.

44

5.4. Results and Discussion

Table 5.1: The Cllr,min of applying different classifiers (Multilayer Perceptrons(MLP), Logistic Regression (LR), Von-Mises-Fisher Scoring (VMF), Gaussian Scoring(GS) and Naive Bayesian Classifier (NBC)) over the i-vector and the NFA frameworks

Utterance Cllr,minModeling MLP LR VMF GS NBCi-vector 0.86 0.86 0.90 0.98 0.93NFA 0.92 0.90 0.91 0.97 0.98

Table 5.2: The AUC of applying different classifiers (Multilayer Perceptrons (MLP),Logistic Regression (LR), Von-Mises-Fisher Scoring (VMF), Gaussian Scoring (GS)

and Naive Bayesian Classifier (NBC)) over the i-vector and NFA frameworks

Utterance AUCModeling MLP LR VMF GS NBCi-vector 0.70 0.74 0.51 0.56 0.66NFA 0.63 0.68 0.65 0.59 0.56

5.4 Results and Discussion

This section presents the results of the proposed smoking habit detection approach.The acoustic feature consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs)including energy appended with their first and second order derivatives, forming a 60dimensional acoustic feature vector. MFCCs are obtained using cosine transform ofthe real logarithm of the short-term energy spectrum represented on a mel-frequencyscale [80]. This type of feature is very common in the state-of-the-art i-vector-basedspeaker recognition systems. To have more reliable features, Wiener filtering, speechactivity detection [68] and feature warping [80] have been considered in front-endprocessing.

The obtained Cllr,min and AUC of applying different classifiers over the i-vector-based and the NFA-based recognizers are listed in Tables 5.1 and 5.2 respectively.

As these tables show, MLPs and LR yielded more accurate results compared withother applied classifiers. The MLPs used in this study was trained using differentnumber of hidden layers and hidden neurons, various learning algorithms and varietyof activation functions. Then, based on the obtained results on the developmentset, the best network architecture has been selected for further experiments. Thus,the architecture of MLPs consisted of a three-layer neural network with 150 hiddenneurons. A linear and logistic activation functions were selected for hidden layer andoutput layer, respectively. The network was trained using “Gradient descent withmomentum and adaptive learning rate back-propagation” training algorithm. Thenetworks were trained using Matlab Neural Network toolbox. This toolbox offersonly two performance functions, namely mean-square-error and mean-absolute-errorto be minimized while networks are trained. Based on the obtained results onthe development set, the mean-absolute-error was selected as a better performancefunction to be minimized in this problem. To attenuate the effect of random

45


initialization, each experiment was repeated 10 times, and the most observed resultwas reported.

It is also shown that the i-vector framework, which is based on the Gaussianmeans, is more accurate in smoking habit detection than the NFA framework, whichis based on the Gaussian weights. Different studies show that GMM weights, whichentail a lower dimension compared with Gaussian mean supervectors, carry less, yetcomplementary, information to GMM means [62, 115, 11]. For example, Zang et al.applied GMM weight adaptation in conjunction with mean adaptation for a largevocabulary speech recognition system to improve the word error rate [115]. Li etal. investigated the application of GMM weight supervectors in speaker age grouprecognition and showed that score-level fusion of classifiers based on GMM weightsand GMM means improves recognition performance [62]. In [11] the feature levelfusion of the i-vectors, GMM mean supervectors, and GMM weight supervectors isapplied to improve the accuracy of accent recognition. Therefore, to enhance thesmoking habit detection accuracy, a score-level fusion of the i-vector and the NFArecognizers was applied using MLPs and LR. The fusion was performed by traininga logistic regression on the outputs of logistic regressions, and a three-layer neuralnetwork on the outputs of MLPs on the development dataset.

To this end, as illustrated in Figure 5.4, each utterance of training, developmentand testing sets was mapped onto a high dimensional vector using one of thementioned utterance modeling approaches described in Section 2.2.2. During thetraining phase, the obtained i-vectors and NFA vectors of the training set were usedas features with their corresponding smoking habit labels to train the model-1 andthe model-2, respectively.

During the development phase, the obtained i-vectors and NFA vectors of thedevelopment set were applied to the trained model-1 and model-2, respectively. Theoutputs of model-1 and model-2 are then concatenated. This vector along withcorresponding smoking habit labels of development set were applied to train themodel-3 to fuse the scores.

Finally, during the testing phase, the obtained i-vectors and NFA vectors of thetest set were applied to the trained model-1 and model-2, respectively. Then, thescores of model-1 and model-2 were concatenated and applied to the trained model-3to detect the labels of test speakers.

The Cllr,min and AUC of MLPs and LR classifiers after score-level fusion arecompared in Table 5.3. It can be interpreted from the table that the score-levelfusion of the i-vector-based with the NFA-based recognizers significantly improvesthe accuracy of smoking habit detection. This improvement is more evident when aMLP is used.

Using MLP, the relative improvements of Cllr,min obtained by the proposed fusionmethod compared with the i-vector and the NFA frameworks are 3.5% and 9.8%,respectively. The relative improvements of AUC compared with the i-vector and theNFA frameworks are 6.6% and 16%, respectively.

The ROC curves of the proposed fusion for male and female speakers are illustratedin Figure 5.5(a) and Figure 5.5(b), respectively.

46

5.5. Conclusions

Model 3 y

devy

veci

devy−

ˆ

NFA

devy

veci

tsty−

ˆNFA

tsty

Model 1

try

veci

devx−

U.M. U.M.

devx

veci

tstx−

tstx

veci

trx−

Model 2

try

U.M. U.M.

devx tstx

NFA

trx

NFA

devxNFA

tstx

trx

U.M

.

trxU

.M.

Figure 5.4: Block-diagram of the proposed smoker detection approach for score-level fusion of the i-vector-based recognizer (model-1) and the NFA-based recognizer

(model-2). (U.M. stands for utterance modeling)

5.5 Conclusions

In this chapter, an automatic smoking habit detection from spontaneous telephonespeech signals was proposed. In this approach, each utterance was modeled using thei-vector and the NFA frameworks by applying factor analysis on GMM means andweights, respectively. For each utterance modeling method, five different classifiers,namely MLP, LR, VMF, GS and NBC have been employed to find out suitablematches between the utterance modeling schemes and the classifiers. Furthermore,score-level fusion of the i-vector-based and the NFA-based recognizers was performedto improve the classification accuracy. The proposed method was evaluated ontelephone speech signals of speakers whose smoking habits were known, drawn fromthe NIST 2008 and 2010 SRE databases. The study results show that the AUC forboth male and female speakers after the score-level fusion is 0.75, and the Cllr,min for

47


Table 5.3: The comparison between Cllr,min and AUC of MLP and LR classifiersafter score-level fusion of the i-vector-based and the NFA-based recognizers, alongwith the relative improvements in Cllr,min and AUC compared with the i-vector

framework.Classifiers Cllr,min R.I. AUC R.I.LR 0.84 2.3% 0.75 1.3%MLP 0.83 3.5% 0.75 6.6%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True Positive Rate

ROC Curve for Male Speakers (AUC=0.75)

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True Positive Rate

ROC Curve for Female Speakers (AUC=0.75)

(b)

Figure 5.5: The ROC curves of the proposed method for (a): male speakers, and(b): female speakers.

male and female speakers after the score-level fusion are 0.84 and 0.82, respectively.

48

Chapter 6

Multitask Speaker Profiling

6.1 Introduction

A simple approach for estimating speaker features is to investigate each trait as asingle-task and in isolation, ignoring the task relatedness. Previous studies weremostly focused on independent evaluation of speaker profiling tasks as single-tasks.However, there might be a meaningful relation between some characteristics of aspeaker. In other words, some traits of a speaker are influenced by other char-acteristics. For instance, the perceived age of smokers are different from that ofnon-smokers of the same calender age [21]. In [110], authors demonstrated that theemotional behaviors highly depend on the gender. That is, males and females expressdifferent emotional behaviors. There might be a relation between the body size(weight/height) of a speaker and his/her age. Thus, providing other characteristicor behavioral information of a speaker in the form of a multitask learning (MTL)approach might improve the accuracy of the single-task speaker profiling.

In contrast to single-task learning (STL) systems in which each task is learnedin isolation, in MTL, related tasks are learned simultaneously by extracting andutilizing appropriate shared information across tasks. MTL has recently attracted alot of attention in the machine learning community since learning multiple regres-sion and classification tasks simultaneously and in parallel allows to model mutualinformation between the tasks and consequently improves recognition performancefor the individual tasks. This concept is inspired from the human learning methodsin which tasks are usually learned in interaction with several related tasks.

Extensive experimental and theoretical studies have demonstrated the effective-ness of multitask learning in improving the performance relative to learning eachtask in isolation [3, 14, 27, 40, 102]. In experiments conducted by Lu et al. [63], theperformance of the automatic speech recognition in noise as the main task was im-proved by including speech enhancement and gender recognition as additional tasks.Weninger et al. have also significantly improved the recognition of speaker traitsand states by applying a MTL to this problem [111]. In [110, 108], the performanceof an automatic emotion recognition system was improved by considering genderinformation. This improvement was made due to the fact that males and females

49

6. Multitask Speaker Profiling

140 150 160 170 180 190 200 21010

20

30

40

50

60

70

80

90

100Male and Female Speakers

Ag

e

Height

(a)

20 40 60 80 100 120 140 16010

20

30

40

50

60

70

80

90

100

Ag

e

Male and Female Speakers

Weight

(b)

140 150 160 170 180 190 200 21020

40

60

80

100

120

140

160

Height

Wei

gh

t

Male and Female Speakers

(c)

Figure 6.1: The scatter plots of (a): age and height, (b): age and weight, and(c): height and weight of speakers in NIST 2008 and 2010 SRE databases for both

genders.

express their emotions in different manners, and a MTL system can better discrimi-nate these differences compared with a single-task automatic emotion recognitionsystem. Authors in [92] applied MTL to improved the accuracy of single-task speakerclassification by providing speakers height, age and race information.

To implement a MTL system, a neural network with hidden layers is an appropri-ate approach which provides a shared representation among the tasks. In multilayerperceptrons (MLPs), each task is associated with a target. It means that implement-ing MTL with a neural network requires adding extra targets corresponding to newtasks. By providing additional tasks, network learns internal representations towardsminimizing errors of main and extra tasks. Thus, learned information are sharedwhen tasks are learned in parallel. It means that a network learns several relatedtasks in an efficient manner compared with learning each task in isolation [26].

Considering automatic estimation of height, weight, age and smoking habit assingle-task problems, training classifiers and regressors to predict multiple relatedvariables simultaneously can improve the performance of speaker profiling. MTLimprovement is based on the mechanisms which are dependent on the relatedtasks [27]. That is, if tasks are related, MTL performs better than STL to findunderlying patterns, otherwise STL may learn better. Figure 6.1, illustrates thecorrelation between different traits of speakers in NIST 2008 and 2010 SRE databases.The correlation is a useful representation of relationships between variables. Thecorrelation between speakers height and weight, height and age, and weight andage are 0.538, −0.103 and 0.157, respectively. However, related tasks should becorrelated at the representation’s level and not necessarily at the output level [26].That is, the internal representations (which are useful for the different tasks) shouldbe correlated. Caruana discussed various criteria for tasks to be related with eachother in [27].

In this chapter, the impact of multitask learning on the performance of speakerprofiling systems, by means of providing additional information of a speaker, isinvestigated. For the purpose of comparison, the obtained results in previous chaptersare considered as baselines. Then, the classifiers and estimators are trained to predict

50


multiple related variables simultaneously. To this end, in a series of experiments, onetask is considered as a main task. Then, by introducing additional tasks one-by-one,the impact of applying MTL on the performance of the speaker profiling system isevaluated.

In this study, a hybrid architecture involving the i-vector and the NFA frameworksis proposed. This method is based on the score-level fusion of the i-vector-based andthe NFA-based recognizers, which utilizes information in GMM means and weights.

For this problem, artificial neural networks (ANNs) are employed to capture theunderlying patterns between input and output data. Evaluation on the NIST 2008and 2010 SRE corpora shows the effectiveness of the proposed approach.

After this introduction, the problem of multitask speaker profiling is formulatedand the proposed approach is described in Section 6.2. Section 6.3 explains theexperimental setup. The results are presented and discussed in Section 6.4. Thechapter concludes in Section 6.5.


In this section, the problem formulation and the main constituents of the proposedmethod are described.


In a multitask speaker profiling, we are given a set of training data D = xi,yiNi=1,where xi ∈ Rp denotes the ith utterance and yi denotes a vector composed of thecombination of two or more age, height, weight and smoking habit labels. The goalis to design an estimator or a classifier g, such that for an utterance of an unseenspeaker xtst, the estimated labels y = g(xtst) approximates the actual label as goodas possible in some predefined sense.


Converting variable-duration speech signals into fixed-dimensional vectors to makethem suitable for classification/regression algorithms is the first step in multitaskspeaker profiling. This procedure is performed by fitting a GMM to acoustic fea-tures extracted from each speech signal. The parameters of the obtained GMMscharacterize the corresponding utterance.

Since the available data is limited, an accurate adaptation of a separate GMM fora short utterance is not possible, specially in the case of GMMs with a large numberof Gaussian components. Therefore, parametric utterance adaptation techniquesshould be applied to adapt a UBM to characteristics of utterances in training andtesting databases. In this chapter, the i-vector framework for adapting UBM meansand the NFA framework for adapting UBM weights are applied. The UBM andthe methods of UBM adaptation using the i-vector and the NFA frameworks areexplained in Chapter 2, in Section 2.2.2.

51


6.2.3 Function Approximations and Classifiers

In MTL, related tasks are learned simultaneously by extracting and utilizing appro-priate shared information across tasks. Implementing a multitask speaker profilingnecessitates an architecture which involves a shared representation between tasks.Since hidden layers of MLPs provide this property, MTL in this study is performedby ANNs.

Artificial Neural Networks

To implement a MTL system, a neural network with hidden layers is an appropriateapproach which provides a shared representation among the tasks. A layered structureof feedforward neural networks provides the ability to share the learned informationwhile the tasks are learned in parallel. In a neural network, there are one ormore hidden layers of computational nodes, which provides a shared representationbetween the tasks. The number of hidden units should be large enough to capturethe underlying pattern between the input-output data, and an appropriate numberof hidden neurons should be selected by a trial-and-error procedure. Appendix Aexplains the concept and relations of the ANNs in more detail.

In MLPs, each task is associated with a target. It means that implementingMTL with a neural network requires adding extra targets corresponding to newtasks. By providing additional tasks, network learns internal representations towardsminimizing errors of the main and extra tasks. Thus, learned information are sharedwhen tasks are learned in parallel. It means that a network learns several relatedtasks in an efficient manner compared with learning each task in isolation [26].

To perform MTL, numerous dynamic and static network architectures consistingof different number of hidden layers and hidden neurons, various activation functionsand variety of training algorithms networks are trained. Then, the trained networksare tested on the validation data, and based on the obtained results, the best networkarchitectures are selected to be evaluated on the test data. Networks are implemented,trained and tested using Matlab Neural Network toolbox, version 6.0.2.


In this study, experiments are divided into two categories based on the relatednessof tasks. The first category proposes MTL for speaker age and smoking habitsestimation. The second one proposes MTL for speaker height, weight and ageestimation. These two proposed multitask speaker profiling methods are depicted inFigure 6.2 and Figure 6.3, respectively.

The procedure of training and testing for both categories are the same. Asillustrated in figures, each utterance of training, development and testing sets ismapped onto a high dimensional vector using one of the mentioned utterance modelingapproaches described in Section 2.2.2.

During the training phase, the obtained i-vectors and NFA vectors of the trainingset are used as features with their corresponding labels to train model-1 and model-

52


Model 1

veci

devx−

U.M. U.M.

devx

veci

tstx−

tstx

SA

trtr yy

Model 2

U.M. U.M.

devx tstx

NFA

trx

NFA

devxNFA

tstx

SA

trtr yy

veci

dev

veci

dev yy−− ,,

ˆˆSA

NFA

dev

NFA

dev yy,,

ˆˆSA

Model 3 SA

yy ˆˆ

NFA

tst

NFA

tst

y

y

,

,

ˆ

ˆ

S

A

NFA

tst

NFA

tst

y

y

,

,

ˆ

ˆ

S

A

SA

devdev yy

veci

trx−

trx

U.M

. trx

U.M

.

Figure 6.2: Block diagram of the proposed multitask speaker profiling approachfor speaker age estimation and smoking habit detection, in training, developmentand testing phases. U.M. stands for utterance modeling. yAtr and yStr represent thetraining labels corresponding to age and smoking habit, respectively, and yA and ySrepresent the estimated age and smoking habit, respectively, after applying a test

sample xtst.

2, respectively. Therefore, model-1 is considered as an i-vector-based model, andmodel-2 is considered as an NFA-based model, and both models perform MTL.

During the development phase, the trained models simultaneously estimate therequested labels of utterances of the development set. To this end, the obtainedi-vectors and NFA vectors of the development set are applied to the trained model-1and model-2, respectively. The outputs of model-1 and model-2 are then concatenated,and along with corresponding labels of development set, are applied to train themodel-3 to fuse the results.

Finally, during the testing phase, the trained models estimate the labels ofutterances of unseen speakers (utterances of the test set). This is performed byapplying the obtained i-vectors and NFA vectors of the test set to the trainedmodel-1 and model-2, respectively. Then, the outputs of model-1 and model-2 areconcatenated and applied to the trained model-3 to estimate the labels of the testutterances.

53


Model 3 WHA

yyy ˆˆˆ

ivec

tst

ivec

tst

ivec

tst

y

y

y

,

,

,

ˆ

ˆ

ˆ

W

H

A

NFA

tst

NFA

tst

NFA

tst

y

y

y

,

,

,

ˆ

ˆ

ˆ

W

H

A

Model 1

veci

devx−

U.M. U.M.

devx

veci

tstx−

tstx

ivec

dev

ivec

dev

ivec

dev yyy,,,

ˆˆˆWHA

NFA

dev

NFA

dev

NFA

dev yyy,,,

ˆˆˆWHA

WHA

trtrtr yyy

Model 2

trx

U.M. U.M.

devx tstx

U.M

.

NFA

trx

NFA

devxNFA

tstx

WHA

trtrtr yyy

WHA

devdevdev yyy

veci

trx−

trx

U.M

.

Figure 6.3: Block diagram of the proposed multitask speaker profiling approach forspeaker age, height and weight estimation, in training and testing phases. U.M. standsfor utterance modeling. yAtr, yHtr and yWtr represent the training labels correspondingto age, height and weight, respectively, and yA, yH and yW represent the estimated

age, height and weight, respectively, after applying a test sample xtst.


6.3.1 Corpus

For the purpose of MTL speaker profiling, telephone recordings from the commonprotocols of the recent National Institute for Standard and Technology (NIST) 2008and 2010 Speaker Recognition Evaluation (SRE) databases are used due to thelarge number of speakers and because the total variability subspace requires a largeamount of development data for training. The development data set used to trainthe total variability subspace and UBM includes over 30,000 speech recordings andwas sourced from the NIST 2004-2006 SRE databases, LDC releases of Switchboard2 phase III and Switchboard Cellular (parts 1 and 2).

Speakers of NIST 2008 and 2010 SRE databases are pooled together to create adataset of 1445 speakers. Then, they are divided into three disjoint parts such that60%, 20% and 20% of all speakers are used for training, development and testingrespectively. Thus, of all 6080 utterances, 3194 utterances are considered for trainingset, 1692 utterances for development set, and 1194 utterances for testing set.

54



In order to evaluate the effectiveness of the proposed MTL system, and for thepurpose of comparison, the systems are evaluated with the same performance metricsas single-tasks were examined in previous chapters. The minimum Log LikelihoodRatio Cost (Cllr,min) and the Area Under the ROC Curve (AUC) are employedto evaluate the performance of MTL in smoking habit detection. These measuresare described in Chapter 5, in Sections 5.3.2 and 5.3.2, respectively. To evaluatethe performance of MTL in speaker height, weight and age estimation, the Pearsoncorrelation coefficients (CC) is used, which is described in Chapter 2, in Section 2.3.2.


In this section, the proposed multitask speaker profiling approach is evaluated. Theacoustic feature vector is a 60-dimensional vector consists of 20 Mel-FrequencyCepstrum Coefficients (MFCCs) including energy appended with their first andsecond order derivatives. MFCCs are obtained using cosine transform of the reallogarithm of the short-term energy spectrum represented on a mel-frequency scale [80].This type of feature is very common in the i-vector-based speaker recognition systems.Wiener filtering, feature warping [80] and voice activity detection [68] have also beenconsidered in the front-end processing to obtain more reliable features.

In this chapter, model-1 is an i-vector-based model, model-2 is an NFA-basedmodel which both perform MTL, and ANNs were used to train the models. In thefirst series of experiments, the impact of applying MTL to speaker age, height andweight estimation was evaluated. To this end, in each experiment, one task wasconsidered as a main task. Then, by introducing additional tasks one-by-one, theimpact of MTL on the performance of the speaker profiling system was investigated.In the second series of experiments, the effect of applying MTL to speaker age andsmoking habit estimation was examined.

To find the best network architecture for model-1 and model-2, numerous dynamicand static network architectures consisting of different number of hidden layers andhidden neurons, various activation functions and variety of training algorithmsnetworks were trained. Then, the trained networks were tested on the validationdata, and based on the obtained results, the best network architectures were selected.

Therefore, to perform MTL for speaker height, age and weight estimation, afour-layer NN consisted of 100 and 150 neurons in the first and the second hiddenlayers, respectively, were selected for model-1 and model-2. To perform MTL forspeaker age and smoking habit estimation, a four-layer NN composed of 200 neuronsin the first hidden layer and 300 neurons in the second hidden layer was employed tobe trained for model-1 and model-2.

The fusion procedure, as described in section 6.2.4, was performed by training athree-layer NN on the outputs of model-1 and model-2 on the development dataset.The architecture of the fusion network (model-3) consisted of 10 hidden neuronsand logistic activation function in its hidden layer. The training algorithm used totrain the network was “Gradient descent with momentum and adaptive learning rate

55


Table 6.1: The comparison between single-task and multitask speaker profiling forspeaker height and age estimation. CC is the Pearson correlation coefficient between

actual and estimated height/age.

Method of CCHeight CCAgeEstimation Male Female Male FemaleBaselines(STL) 0.36 0.36 0.75 0.85MTL (Height & Age) 0.36 0.37 0.75 0.86Relative improvement 0.0% 2.7% 0.0% 1.2%∗ The bold numbers in the table indicate the improved results.

back-propagation”. The networks were trained using Matlab Neural Network toolbox.This toolbox offers only two performance functions, namely mean-square-error andmean-absolute-error to be minimized while networks are trained. Based on theobtained results on the development set, the mean-absolute-error was selected asa better performance function to be minimized in this problem. To attenuate theeffect of random initialization, the training and testing phases of each experimentwas repeated 10 times, and the most observed result was reported.

Table 6.1 shows the results of applying MTL to speaker age and height estimationafter the score-level fusion. It should be noted that although in Chapter 3, the bestresults for speaker height estimation were 0.41 for males and 0.40 for females whenLSSVR was used, and in Chapter 2 the best result for speaker age estimation formales was 0.76 when LSSVR was employed, since the ANNs were used in MTL,the results of speaker height and age estimations using MLPs were considered asthe baselines. It can be interpreted from this table that MTL slightly improved theperformance of age and height estimation for female speakers.

The results of applying MTL to speaker weight and age estimation after thescore-level fusion are presented in Table 6.2. Except for male age estimation, apositive impact of MTL on the performance of speaker age and weight estimations isevident.

Unfortunately, applying MTL to speaker height and weight estimation, as reportedin Table 6.3, did not improve the performance. This might be due to the fact thatrelated tasks should be correlated at the representation’s level and not necessarilyat the output level [26]. So, one hypothesis is that the speaker height and weightestimation tasks are not correlated at the level of representation, hence are notrelated tasks.

Table 6.4 lists the results of applying MTL to speaker height, weight and ageestimation after the score-level fusion. Except for male height and male age estima-tions, MTL had a positive impact on the accuracy of other tasks. This improvementwas more evident for speaker weight estimation.

The correlation coefficients between actual and estimated age, height and weightwhen the male and female data were pooled together are 0.82, 0.74 and 0.60 re-spectively, which show the improvement in performance, since the corresponding

56


Table 6.2: The comparison between single-task and multitask speaker profiling forspeaker weight and age estimation. CC is the Pearson correlation coefficient between

actual and estimated weight/age

Method of CCWeight CCAgeEstimation Male Female Male FemaleBaselines(STL) 0.52 0.39 0.75 0.85MTL (Weight & Age) 0.54 0.41 0.75 0.87Relative improvement 3.7% 4.8% 0.0% 2.3%∗ The bold numbers in the table indicate the improved results.

Table 6.3: The comparison between single-task and multitask speaker profilingfor speaker height and weight estimation. CC is the Pearson correlation coefficient

between actual and estimated height/weight

Method of CCHeight CCWeight

Estimation Male Female Male FemaleBaselines(STL) 0.36 0.36 0.52 0.39MTL (Height & Weight) 0.36 0.36 0.52 0.39Relative improvement 0.0% 0.0% 0.0% 0.0%

Table 6.4: The comparison between single-task and multitask speaker profiling forspeaker height, weight and age estimation. CC is the Pearson correlation coefficient

between actual and estimated height/weight/age

Method of CCHeight CCWeight CCAgeEstimation Male Female Male Female Male FemaleBaselines(STL) 0.36 0.36 0.52 0.39 0.75 0.85MTL (Height, Weight & Age) 0.36 0.39 0.56 0.41 0.75 0.86Relative improvement 0.0% 7.7% 7.1% 4.8% 0.0% 1.2%

∗ The bold numbers in the table indicate the improved results.

values for the single-task age, height and weight estimations were 0.82, 0.60 and 0.59,respectively.

In Table 6.5, the results of applying MTL to speaker age estimation and smokinghabit detection are presented. By comparing the obtained results with the baselines,we can conclude that the proposed MTL approach for smoker detection and ageestimation improves the results of smoking habit detection. However, the performanceof age estimation was not improved in this MTL approach. The ROC curves of theMTL smoker detection for male and female speakers after the score-level fusion andwhen age information is considered are illustrated in Figure 6.4(a) and Figure 6.4(b),

57


Table 6.5: The comparison between single-task and multitask speaker profilingfor speaker age estimation and smoking habit detection. Cllr,min is the minimumlog-likelihood ratio cost, AUC is the area under the ROC curve, and CC is the

Pearson correlation coefficient between actual and estimated age

Method of Smoker Detection CCAgeEstimation Cllr,min AUC Male FemaleBaselines(STL) 0.83 0.75 0.75 0.85MTL (Smoking habits & Age) 0.81 0.79 0.75 0.85Relative improvement 2.4% 5.1% 0.0% 0.0%

∗ The bold numbers in the table indicate the improved results.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

osi

tive

Rat

e

ROC Curve for Male Speakers (AUC=0.78)

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

osi

tive

Rat

e

ROC Curve for Female Speakers (AUC=0.795)

(b)

Figure 6.4: The ROC curves of the proposed MTL smoker detection after thescore-level fusion and when age information is considered, for (a): male speakers,

and (b): female speakers.

respectively. By applying the proposed MTL method, the AUC for male and femalespeakers after the score-level fusion is 0.78 and 0.79, respectively. The obtainedCllr,min for male and female speakers are 0.80 and 0.83, respectively.

6.5 Conclusion

In this chapter, the performance of multitask learning in speaker profiling wasevaluated. In proposed method, each utterance of the NIST 2008 and 2010 SREdatabases was modeled using the i-vector and the NFA frameworks by applyingfactor analysis on GMM means and weights, respectively. In this study, a hybridarchitecture involving the score-level fusion of the i-vector-based and the NFA-based

58

6.5. Conclusion

recognizers was proposed.In series of experiment, each speaker profiling task was considered as the main

task and extra tasks were added one-by-one to the main task. In this study, a MTLapproach for speaker age, height and weight estimation, and a MTL method forspeaker age and smoking habit estimation were performed separately. By comparingthe results of MTL with the best results obtained in previous chapters for eachsingle-task, it can be concluded that related tasks can improve the performanceof the main task. On the other hand, unrelated tasks are not able to improve theperformance.

The experimental results show that the performance of speaker weight estimationand smoker detection were improved when age information was provided. Thisimprovement was occurred only for female height estimation. On the other hand,performing MTL for speaker height and weight estimation did not improve theperformance of the height or weight estimations.

Furthermore, when MTL was applied to estimate age, height and weight simulta-neously, an improvement in performance was observed for speaker weight estimation,as well as female height and age estimations. The relative improvement of CC forspeaker weight estimation compared with the baselines for males and females were7.1% and 4.8%, respectively. This improvement for female age and female heightestimations were about 1.2% and 7.7%, respectively. Cllr,min and AUC for MTLsmoking habit detection were also experienced relative improvements equal to 2.4%and 5.1%, respectively.

59

Chapter 7

Conclusion

7.1 Summary and Contributions

In this thesis, novel approaches for four forensically important tasks of automaticspeaker profiling, namely automatic speaker age, height, weight and smoking habitestimations from spontaneous telephone speech signals have been proposed. Inspeaker height and weight estimation tasks, utterances were modeled using thei-vector framework. The i-vector framework, which is based on the factor analysisfor GMM mean adaptation and decomposition, provides a compact representation ofan utterance in the form of a low-dimensional feature vector, and can be effectivelysubstituted for the GMM mean supervectors in speaker profiling tasks [9]. Inspeaker age estimation and smoking habit detection tasks, utterances were modeledusing the i-vector and the NFA frameworks. The NFA framework provides a low-dimensional utterance representation, and is based on a constrained factor analysison GMM weights such that the adapted weights are non-negative and sum tounity. This framework demonstrated to provide less information compared with thei-vector framework. However, utilizing the NFA vectors in conjunction with thei-vectors in speaker profiling systems might enhance the performance, thanks to thecomplementary information which is provided by the NFA framework.

In Chapter 2, a new approach for age estimation from spontaneous telephonespeech signals based on a hybrid architecture of the NFA and the i-vector frameworkswas proposed. This architecture was consisted of two subsystems based on thei-vectors and the NFA vectors. ANNs and LSSVR were applied to perform regression.This method is distinguished from previous age estimation methods due to exploitingthe available information in both Gaussian means and Gaussian weights througha score-level fusion of the i-vector-based and the NFA-based subsystems. Theeffectiveness of the proposed method was investigated on spontaneous telephonespeech signals of the NIST 2008 and 2010 SRE corpora. Experimental results (asrepresented in Table 7.1) demonstrated that the score-level fusion of the i-vector-basedand the NFA-based estimators decreased the mean-absolute-error, and enhancedthe Pearson correlation coefficient between estimated and actual age comparedwith the i-vector framework. The proposed method has improved the MAE of

61

7. Conclusion

Table 7.1: The relative improvements (R.I.) in CC and MAE of the proposedspeaker age estimation after score-level fusion compared with the i-vector framework.CC is the Pearson correlation coefficient between actual and estimated age, and

MAE is the mean-absolute error between actual and estimated age.

Function Male FemaleApproximation R.I.CC R.I.MAE R.I.CC R.I.MAE

LSSVR (RBF) 4.2% 9.9% 6.1% 17.9%LSSVR (Linear) 6.6% 18.2% 7.1% 21.8%Three-Layer NN 4.1% 15.4% 3.6% -3.2%Four-Layer NN 2.7% 12.4% 3.5% 15.8%

Table 7.2: The relative improvements (R.I.) in Cllr,min and AUC of the proposedsmoking habit detection after score-level fusion compared with the i-vector framework.

Classifiers R.I. in Cllr,min R.I. in AUCLR 2.3% 1.3%MLP 3.5% 6.6%

the baseline [9] (provided on the same databases) for males and females by 8.6%and 22.2%, respectively, which reflects the effectiveness of the proposed method inautomatic speaker age estimation.

In Chapter 3 and Chapter 4, new methods for automatic speaker body size (heightand weight) estimation from spontaneous telephone speech signals were introduced.In this research, the effectiveness of the i-vector framework in the estimation ofspeakers’ body size was investigated for the first time. In these methods, eachutterance was modeled by its corresponding i-vector. Then, ANNs and LSSVR wereemployed to estimate the body size of a speaker from a given utterance. The proposedmethods were trained and tested on the telephone speech signals of the NIST 2008and 2010 SRE databases. Evaluation results demonstrated the effectiveness of theproposed methods in automatic speaker height and weight estimations.

Chapter 5, for the first time, introduced an automatic smoking habit detectionfrom spontaneous telephone speech signals, which was based on the utterance mod-eling using the i-vector and the NFA frameworks. For each utterance modelingmethod, five different classifiers, namely MLP, LR, VMF, GS and NBC have beenemployed. Furthermore, to improve the classification accuracy, a score-level fusionof the i-vector-based and the NFA-based recognizers was performed. The proposedmethod was evaluated on telephone speech signals drawn from the NIST 2008 and2010 SRE databases. The relative improvements of Cllr,min and AUC obtained bythe proposed score-level fusion compared with the i-vector framework are presentedin Table 7.2. This table reflects the effectiveness of the proposed method in smokinghabit detection.

The final and important contribution of this thesis was presented in Chapter 6.In this chapter, the impact of multitask learning on the performance of speaker

62

7.1. Summary and Contributions

Table 7.3: The relative improvements in CC for age, height and weight estimationsobtained in a multitask age, height and weight estimation, compared with the

baselines.Tasks in MTL Relative Improvement in CCSpeaker Profiling Male FemaleAge estimation 0.0% 1.2%Height estimation 0.0% 7.7%Weight estimation 7.1% 4.8%

Table 7.4: The relative improvements (R.I.) in Cllr,min and AUC for smoking habitdetection and the R.I. in CC for age estimation obtained in a multitask smoking

habit and age estimation compared with baselines.

Tasks in MTL R. I. in R. I. in R.I. inSpeaker Profiling CCAge Cllr,min AUC

Age estimation 0.0% — —Smoker detection — 2.4% 5.1%

profiling systems was investigated. In the proposed method, each utterance of theNIST 2008 and 2010 SRE databases was modeled using the i-vector and the NFAframeworks. In this study, due to the task relatedness, a MTL for speaker age, heightand weight estimation, and a MTL for speaker age and smoking habit estimationwere performed in separate experiments.

Experimental results show that providing age information improves the perfor-mance of speaker weight estimation and smoker detection. However, age informationcan only enhance the performance of female height estimation. On the other hand,performing a MTL for speaker height and weight estimation has no positive impacton the performance of the height or weight estimations. Moreover, applying a MTLfor estimating age, height and weight simultaneously, results in performance improve-ments for speaker weight estimation, female height estimation as well as female ageestimation.

The relative improvements in CC for age, height and weight estimations obtainedin a multitask age, height and weight estimation compared with the baselines arepresented in Table 7.3. In Table 7.4, the relative improvements in Cllr,min and AUCfor smoking habit detection and the relative improvement in CC for age estimationobtained in a multitask smoking habit and age estimation compared with baselinesare presented.

The proposed approach has two major distinctions with the previous speakerprofiling methods. First, in the proposed method, the available information inboth GMM means and weights was employed through a score-level fusion of thei-vector-based and the NFA-based recognizers and estimators. Second, by applyingMTL, correlated tasks, which were usually investigated in isolation, were evaluatedsimultaneously and in interaction with each other.

63

7. Conclusion

7.2 Future DirectionApproaches to improve the performance of speaker profiling systems from the machinelearning point of view can be categorized into two groups: first, modifying inputdata and second, modifying methods of inferring paralinguistic data from the speechsignals.

Feature-level fusion is an approach which provides complementary informationfor recognizers. Providing additional related features which are postulated to havecorrelation with the objective criteria, can enhance the performance of speakerprofiling tasks. Subglottal resonance frequencies for instance, are known to correlatewith physical height [5, 4], which might improve the accuracy of height estimationby applying feature-level fusion (and normalization) of the i-vectors and subglottalresonance frequencies.

Modifying methods of inferring paralinguistic data from speech signals can beperformed using different machine learning algorithms. In this study several methodssuch as LSSVR, ANNs, LR, NBC, GS and VMF were employed. For each speakerprofiling task, one should look for a classifier or an estimator which matches withthe utterance modeling scheme. The fusion of different classifiers and estimators hasalso demonstrated to improve the performance of speaker recognition systems.

The success of employing MTL for speaker age, height, weight and smoking habitestimation was observed in this thesis. This idea can be extended to other speakerprofiling tasks such as accent, language and dialect to improve the performance ofspeaker profiling systems.

64

Appendices

65

Appendix A

Artificial Neural Networks(ANNs)

In this appendix, concepts and equations related to artificial neural network theoryare reviewed in brief. More extensive explanation of this topic can be found in [53,43, 36, 100].

A.1 Multilayer Perceptron Neural Networks

After the introduction of the neural networks’ concept, many researchers have suc-cessfully applied neural networks on different applications. A significant improvementwas obtained after the introduction of multilayer perceptron in conjunction withthe back-propagation technique for learning the weights of interconnections given in-put/output patterns. While a single perceptron can realize a linear decision boundaryin the input space, a multilayer perceptron is able to construct a nonlinear decisionboundary.

A single neuron as depicted in Figure A.1, can be modeled as a nonlinear element.The output of a perceptron is a weighted sum of input signals multiplied by anonlinear activation function. That is,

oj = ϕ(n∑i=1

wixi + b) (A.1)

where b is a bias or threshold, and nonlinear function is typically of the saturationtype, e.g. tanh(·).

A multilayer perceptron (shown in Figure A.2) is constructed by adding one ormore hidden layers, which can be modeled as:

y = W tanh(Vx + B) (A.2)

where x ∈ Rm is input vector, y ∈ Rpy is output, B ∈ Rnh is bias vector, V ∈ Rnh×m

is an interconnection matrix for hidden layer, and W ∈ Rpy×nh is an interconnectionmatrix of output layer.

67

A. Artificial Neural Networks (ANNs)

Figure A.1: The structure of a single neuron.

Figure A.2: The structure of a multilayer perceptron (MLP).

The activation functions commonly used in feedforward neural networks arelogistic, hyperbolic tangent and linear functions. The logistic function takes thefollowing form, which its output lies between 0 and 1:

ϕ(x) = 11 + exp(−ax) , a > 0 (A.3)

The general form of the hyperbolic tangent function, which its output takes valuesbetween −1 and 1, is defined as:

ϕ(x) = 1− exp(−2x)1 + exp(−2x) (A.4)

The linear function is defined as:

ϕ(x) = x (A.5)

Selecting an appropriate activation function for each layer depends on application.For regression problems, a linear function is employed in the output layer, while,for classification problems, the output layer can take logistic or hyperbolic tangentfunctions. Depends on the degree of the problem’s nonlinearity, all types of activationfunctions can be used for hidden layers.

From the universal approximation theorem [54], a feedforward network of a singlehidden layer containing a finite number of neurons is sufficient to compute a uniform

68

A.2. Regression and Classification using MLPs

approximation for a given training set and its desired outputs. The number of hiddenneurons has to be chosen by a trial-and-error procedure, since there is no generalrule to calculate the best number of hidden units.

Barron in [15] demonstrated that under certain conditions, MLPs with one hiddenlayer have approximation error of order of magnitude O(1/nh) which is independentof the dimension of the input data. Barron showd that models based on MLPscan better handle larger dimensional inputs than polynomial expansions, since theapproximation error of polynomial expansions is of order of magnitude O(1/n2/m

x ),where m is the dimension of input data and nx is the number of terms in theexpansion.

A.2 Regression and Classification using MLPs

Back-propagation (BP) algorithm which can be considered as an extension of theLMS algorithm [52], was the first method for training MLPs. In this algorithm, weare given a set of training data xi, yiNi=1, where N denotes the number of trainingdata, and the goal is to minimize the residual squared error cost function in theunknown interconnection weights, by means of steepest decent local optimizationalgorithm. That is,

minθ∈Rp

J(θ) = 1N

N∑i=1‖yi − f(xi; θ)‖22 (A.6)

where θ ∈ Rp is a vector of p unknown interconnection weights. The gradient canbe computed for any number of hidden layers by means of recursive equations ofgeneralized delta rule.

Different training algorithms have been suggested during the last decades [86,46, 65] to enhance the training speed, provide more memory efficient methodsand represent better convergence properties. Some commonly used algorithms aredescribed in brief in Section A.2.1.

As illustrated in Figure A.3, the input patterns are propagated in a forward pathtowards the output layer. The error calculated at the output layer, on the otherhand, is back-propagated in a backward path towards the input layer. During thisprocedure, the values of the interconnection weights are modified.

One important issue which should be taken into consideration during the trainingphase is to avoid over-fitting data. To tackle this problem, one should consider apart of training data as a validation set. This validation set is used to decide aboutwhen to stop training. Otherwise, the network starts memorizing the pattern insteadof performing generalization.

In the case of solving function approximation problems, one can utilize the modeldefined in Equation A.2. This model contains a hidden layer consisting of neuronswith tanh activation functions and an output layer of linear activation function. Inthe case of classification problems, the output layer can take tanh or logistic sigmoidactivation functions. As an alternative approach, the model can be modified in such

69


Figure A.3: The structure of a single hidden layer feedforward neural network witherror back propagation. The solid lines represent the forward paths and the dotted

lines indicate the error back-propagation paths.

a way that a sign function is applied on the output of model used in the regressionproblems (Equation A.2). That is,

y = sign [W tanh(Vx + B) ](A.7)

In binary classification, and in the case that tanh activation function is utilized,the desired output takes either −1 or +1 to network be trained. The desired outputin the case of logistic sigmoid activation function takes either 0 or 1. In multiclassclassification, additional outputs should also be considered to represent the variousclasses.

A.2.1 Training Algorithms

Most of training algorithms are categorized in three classes: steepest descent, quasi-Newton, and conjugate gradient. The method of steepest descent makes a linearapproximation of the cost function when updating the weights. In other words, itonly uses the first order information about the error surface. Including the momentterm in the update equation is an attempt at using second order information aboutthe error surface. However, this introduces an additional parameter to be tuned,which makes the training process more complex.

On the other hand, in methods like conjugate-gradient and quasi-Newton methods,the problem of supervised training of a multilayer perceptron is a problem in numericaloptimization. In these methods, higher-order information about the error surface isused in the training process. This leads to an improvement in the rate of convergence.

Some commonly used training algorithms are described in brief as follows:

• Levenberg-Marquardt (LM) algorithm, uses step size damping by regularizingthe Hessian matrix and exhibits a fast training in comparison with BP [50].

70

A.3. Limitations of ANNs

• In the Fletcher-Reeves conjugate gradient training algorithm (CGF), the searchdirection is computed from the new gradient and the previous search direction,based on the Fletcher-Reeves variation of the conjugate gradient method [86].

• BFGS quasi-Newton backpropagation, is a quasi-Newton method for back-propagation, that converges in few iterations but that requires more computa-tion in each iteration [46].

• Scaled conjugate gradient backpropagation (SCG) algorithm updates weightand bias values according to the scaled conjugate gradient method. In thisalgorithm, Backpropagation is used to calculate derivatives of performancewith respect to the weight and bias variables. The scaled conjugate gradientis based on conjugate directions, but this algorithm does not perform a linesearch at each iteration [71, 36].

• Gradient descent with momentum and adaptive learning rate backpropagation(GDX) algorithm updates weight and bias values according to gradient descentmomentum and an adaptive learning rate. Variables are adjusted according togradient descent with momentum. For each iteration, if performance decreasestoward the goal, the learning rate is increased. If performance increases bymore than a pre-defined factor, the learning rate is adjusted and the change,which increased the performance, is not made. This algorithm is usually muchslower than other methods. However, using algorithms which converge too fastmight result in overshooting from the minimum error point [36].

• Gradient descent with adaptive learning rate backpropagation (GDA) algorithmupdates weights and bias in accordance with gradient descent with adaptivelearning rate. At each iteration, the learning rate is increased if performancedecreases toward the goal [82, 36].

• One step secant backpropagation (OSS) algorithm updates weights and bias inaccordance with the one step secant method. In each subsequent iteration, thesearch direction is computed from the new gradient and the previous steps andgradients [17].

It might be better to apply conjugate gradient methods for networks with, forinstance, more than about one thousand interconnection weights since they don’trequire to store huge matrices.

A.3 Limitations of ANNsIn the previous section, a property of ANNs that can approximate any linear ornonlinear relation between input and output data using an appropriate architectureand weights, was introduced as its important property. However, MLPs havelimitations in data analysis which are described as follows:

• The final result depends highly on the initialization,

71


• training phase should be repeated several hundreds of times,

• training process of the ANN has a stochastic nature,

• the available dataset should be relatively large for effective ANN training,

• the training technique has a "black box" nature,

• there is a risk of overfitting,

• training of ANN can take many hours or even days of CPU time.

The above mentioned limitations have encouraged researchers to introduce alter-native approaches and techniques such as SVM or LSSVM which don’t have suchlimitations in data analysis. The concept of LSSVM is elaborated in Appendix B.

72

Appendix B

Least Squares Support VectorMachines (LSSVM)

In this appendix, the concepts and equations regarding Least Squares Support VectorMachines (LSSVM) for both classification and function approximation problems areexplained in brief. More extensive explanation of this topic can be found in [100].

The SVM is a machine learning algorithm with potential for classification andregression [16]. SVM was initially developed for classification, but the theory wasextended to perform function approximation by Drucker [39]. In [107] a detaileddescription on the SVM theory for both classification and regression is provided.LSSVM is a simplified version of the standard SVM algorithm for classificationand function estimation, which maintains the advantages and the attributes of theoriginal SVM theory [100]. LSSVM exhibits excellent generalization performancesand is associated with low computational costs [100].

While SVMs solve nonlinear classification and function approximation problemsby means of quadratic programming, which results in high algorithmic complexityand memory requirement, a LSSVM involves solving a set of linear equations [100]which speeds up the calculations. This simplicity is achieved at the expense of lossof sparseness, therefore all samples contribute to the model, and consequently, themodel often becomes unnecessarily large.

B.1 Least Squares Support Vector Machines forClassification

If the training set is considered as xi, yiNi=1 where xi ∈ Rp is the ith input dataand yi ∈ −1,+1 is the ith label, a linear classifier is constructed as follows:

y(x) = sign[wTϕ(x) + b] (B.1)

where w is the matrix of model variables, b is the bias term, and ϕ(·) : Rm → Rnh isthe mapping function which maps the input space onto the high dimensional feature

73

B. Least Squares Support Vector Machines (LSSVM)

space. The goal of LSSVM is to minimize the cost function defined in Equation B.2such that yi[wTϕ(xi) + b] = 1− ei.

minw,b,e

JP (w, e) = 12wTw + γ

12

N∑i=1

e2i (B.2)

where γ is the regularization parameter.In general, w might become infinite dimensional. The Lagrangian solution to the

above-mentioned optimization is given by the following equation:

L(w, b, e;α) = JP (w, e)−N∑i=1

αiyi[wTϕ(xi) + b]− 1 + ei (B.3)

where the αi values are the Lagrange multipliers, which thanks to the equalityconstraints, can be positive or negative. The conditions for the optimality ofEquation B.3 are given by the Equation B.4.

∂L∂w = 0→ w =

∑Ni=1 αiyiϕ(xi)

∂L∂b = 0→

∑Ni=1 αiyi = 0

∂L∂ei

= 0→ αi = γei

∂L∂αi

= 0→ yi[wTϕ(x) + b]

(B.4)

After eliminating w and e from the above conditions (Equation B.4), the followingequation is obtained:

[0 yTy Ω + I/γ

] [bα

]=[

01v

](B.5)

where 1v = [1, · · · , 1]T , y = [y1, · · · , yN ]T , e = [e1, · · · , eN ]T , α = [α1, · · · , αN ]T ,and the kernel function can be defined as:

Ωij = yiyjϕ(xi)Tϕ(xj) (B.6)= yiyjK(xi, xj)

When, for instance, the radial basis function (RBF) is used as the kernel, wehave K(xi, xj) = exp(−‖xi − xj‖22/σ2).

Now, the LSSVM classifier in the dual space takes the form

y(x) = sign[N∑i=1

αiyiK(x, xi) + b] (B.7)

74

B.2. Least Squares Support Vector Machines for Regression

B.2 Least Squares Support Vector Machines forRegression

The derivation of LSSVM formulation for the case of nonlinear function approximationis similar to the LSSVM classifier case. If the model in the primal weight space isconsidered as

y(x) = wTϕ(x) + b (B.8)

where x ∈ Rm and y ∈ R, the optimization problem in the primal weight space canbe formulated as what is given in the Equation B.9 such that yi = wTϕ(xi) + b+ ei.

minw,b,e

JP (w, e) = 12wTw + γ

12

N∑i=1

e2i (B.9)

where γ is the regularization parameter.In general, w might become infinite dimensional. The Lagrangian solution to the

above-mentioned optimization is given by the following equation:

L(w, b, e;α) = JP (w, e)−N∑i=1

αiwTϕ(xi) + b+ ei − yi (B.10)

where the αi values are the Lagrange multipliers. The conditions for the optimalityof Equation B.10 are given by the Equation B.11.

∂L∂w = 0→ w =

∑Ni=1 αiϕ(xi)

∂L∂b = 0→

∑Ni=1 αi = 0

∂L∂ei

= 0→ αi = γei

∂L∂αi

= 0→ wTϕ(xi) + b+ ei − yi

(B.11)

After eliminating w and e from the above conditions (Equation B.11), thefollowing equation is obtained:[

0 1Tv1Tv Ω + I/γ

] [bα

]=[

0y

](B.12)

where 1v = [1, · · · , 1]T , y = [y1, · · · , yN ]T , α = [α1, · · · , αN ]T , and the kernel functioncan be defined as:

Ωij = ϕ(xi)Tϕ(xj) (B.13)= K(xi, xj)

Now, the LSSVM model for function approximation in the dual space takes theform

75

B. Least Squares Support Vector Machines (LSSVM)

y(x) =N∑i=1

αiK(x, xi) + b (B.14)

where αi , b are the solution to the linear system (Equation B.12). It worth mentioningthat when the radial basis function (RBF) is used as the kernel, there are just twoparameters (γ, σ) to be tunned, which is less than for the standard SVMs.

The LSSVM unlike ANNs has a global and unique solution, which is an importantproperty of LSSVM. A LSSVM as mentioned before, involves solving a set of linearequations which speeds up the calculations. However, this simplicity is achieved atthe expense of loss of sparseness, therefore all samples contribute to the model, andconsequently, the model often becomes unnecessarily large.

76

Appendix C

Logistic Regression

In this appendix, concepts and equations related to the logistic regression are reviewedin brief. More extensive explanation of this topic can be found in [18].

C.1 Logistic Regression for Binary Classification

Logistic regression (LR) is a widely used classification method [18], which assumesthat

yi ∼ Bernoulli(f(Txi + w0)) (C.1)

where yis are independent, w is a vector with the same dimension of x, w0 is aconstant and f(•) is a logistic function and defined as:

f(•) = 11 + e−(•) (C.2)

The output of the logistic function, as shown in Figure (C.1), takes a valuebetween zero and one.

Figure C.1: Logistic function

In the binary classification problems, we intend to model the probability of acertain label given its featurs. That is, P (y|xi) = f(wTxi + w0), where xi is thefeature vector corresponding to the ith sample. Vector w and constant w0 are themodel parameters, which are found through the maximum likelihood estimation(MLE). The MLE in the logistic regression model for binary cases is described in thenext section.

77

C. Logistic Regression

C.2 MLE in the Logistic Regression model

In order to estimate the parameters of the logistic regression model, maximumlikelihood estimation (MLE) is used. Lets consider αi = σ(wTxi). The MLE for theparameters w is:

wMLE = arg maxw

P (D|w) (C.3)

where

P (D|w) =N∏i=1

P (yi|xi,w) =N∏i=1

αyii (1− αi)yi (C.4)

where each yi is a Bernoulli random variable. Since there are series of productoperations, it is easier to first take a log from the equation.

L(w) = − logP (D|w)

= −N∑i=1

yi logαi + (1− yi) log(1− αi)(C.5)

Now, in order to maximize this probability, the equation (C.5) should be mini-mized with respect to w.

∂L(w)∂wj

= −∂∂wj

N∑i=1

yi logαi + (1− yi) log(1− αi) (C.6)

∂

∂wjlogα = ∂

∂wjlog 1

1 + e−wT x

= xje−wT x

1 + e−wT x = xj(1− α)(C.7)

and

∂

∂wjlog(1− α) = −wTx− log(1 + e−wT x)

= −xj + xj(1− α) = −αxj(C.8)

where α = (α1, ..., αn)T . By substituting equations (C.7) and (C.8) in equation (C.6)

∇wL = ∂L(w)∂wj

=N∑i=1

yi(αi − yi)xij

= (α− y)TA = AT (α− y)(C.9)

where xj = (xj1, ..., xjd)T , A is a design matrix and is defined as:

78

C.2. MLE in the Logistic Regression model

A =

x1,1 · · · x1,dx2,1 · · · x2,d... . . . ...

xn,1 · · · xn,d

Since αi is a nonlinear function in w, to find the maximum value the equation

(C.9) can not be set to zero. But if the Hessian matrix is computed, and if it ispositive semidefinite, it can be concluded that L function in equation (C.5) is aconvex function. So, in order to find MLE, the Newton’s method can be utilizd.

∇2wL = ∂2L(w)

∂wj∂wk=

N∑i=1

xij∂αi∂wk

=N∑i=1

xijxikαi(1− αi)

= zTj BzTk = ATBA

(C.10)

where zj = (x1j , ..., xnj)T , xj = (xj1, ..., xjd)T , and B is a diagonal matrix which isdefined as:

B =

α1(1− α1) · · · 0... . . . ...0 · · · αn(1− αn)

At this moment, since the α is a positive and smaller than one, the B matrix is

positive semidefinite,

∇2wL = ATBA

= ATB12 B

12 AT = (B

12 A)T (B

12 A)

(C.11)

Consequently, The Hessian is positive semidefinite and L is a convex function.Now, the Newton’s method called iterative reweighted least squares (IRLS),

should be applied. The Newton’s method for d-dimensional data is defined as

wt+1 = wt −H−1g

= (ATBA)−1ATB(Awt −B−1(α− y)

)= (ATBA)−1ATBzt

(C.12)

where H and g are the Hessian and the gradient matrices, respectively which evaluatedusing wt. The equation(C.12) is the solution of the weighted least squares problem:

arg minw

N∑i=1

bi(zi −wTxi)2 (C.13)

where bis are weights and defined as (αi − (1− αi)) .

79

C. Logistic Regression

C.3 Advantages and Limitations of the LogisticRegression

The logistic regression which is also a faundation of more complex methods likeneural networks, has the advantage of being interpretable. Considering (wTxi) =(w0 + w1x1 + w2x2 + ... + wnxn), the coefficients give the information about howindividual variables are affecting the probability and which one is more influential inthe model. For instance, the positive coefficient wi means that the probability of theith sample belonging to a certain class is increasing as a variable xi increases.

Another advantage of logistic regression is the small number of the parameters,which results in easier parameter estimation in statistical sense. The number ofparameters is (d + 1), where d is the dimension of input data. It means that thenumber of parameters is increasing linearly by increasing the dimension of data. Thismethod is also computationally efficient to estimate the parameters of the model.

On the other hands, the performance of this method depends on the problem,which means that it does not necessarily show the best performance.

80

Bibliography

[1] J. Ajmera and F. Burkhardt. Age and gender classification using modulationcepstrum. In Proc. Odyssey, 2008.

[2] J. D. Amerman and M. M. Parnell. Speech timing strategies in elderly adults.journal of Voice, 20:65–67, 1992.

[3] R. K. Ando and T. Zhang. A framework for learning predictive structures frommultiple tasks and unlabeled data. Journal of Machine Learning Research,6:1817–1853, 2005.

[4] H. Arsikere, G. Leung, S. Lulich, and A. Alwan. Automatic height estimationusing the second subglottal resonance. In Acoustics, Speech and Signal Pro-cessing (ICASSP), 2012 IEEE International Conference on, pages 3989–3992,March 2012.

[5] H. Arsikere, G. Leung, S. Lulich, and A. Alwan. Automatic estimation of thefirst three subglottal resonances from adults speech signals with application tospeaker height estimation. Speech Communication, 55(1):51–70, 2013.

[6] M. H. Bahari. Automatic Speaker Characterization: Automatic Identificationof Gender, Age, Language and Accent from Speech Signals. PhD thesis, KULeuven – Faculty of Engineering Science, Belgium, May 2014.

[7] M. H. Bahari, N. Dehak, and H. Van hamme. Gaussian mixture model weightsupervector decomposition and adaptation. In Internal Report. Speech Group,2013.

[8] M. H. Bahari, N. Dehak, H. Van hamme, L. Burget, A. Ali, and J. Glass.Non-negative factor analysis of gaussian mixture model weight adaptation forlanguage and dialect recognition. Audio, Speech, and Language Processing,IEEE/ACM Transactions on, 22(7):1117–1129, July 2014.

[9] M. H. Bahari, M. McLaren, H. Van hamme, and D. Van Leeuwen. Ageestimation from telephone speech using i-vectors. In Proc. Interspeech, pages506–509, 2012.

[10] M. H. Bahari, M. McLaren, H. Van hamme, and D. Van Leeuwen. Speaker ageestimation using i-vectors. Journal of Engineering Applications of ArtificialIntelligence, 34:99–108, 2014.

81

Bibliography

[11] M. H. Bahari, R. Saeidi, H. Van hamme, and D. van Leeuwen. Accent recogni-tion using i-vector, gaussian mean supervector and gaussian posterior probabil-ity supervector for spontaneous telephone speech. In Proceedings ICASSP’2013,pages 7344–7348, 2013.

[12] M. H. Bahari and H. Van hamme. Speaker age estimation and gender detec-tion based on supervised Non-Negative Matrix Factorization. In Proc. IEEEWorkshop on Biometric Measurements and Systems for Security and MedicalApplications (BIOMS), pages 1–6, 2011.

[13] M. H. Bahari and H. Van hamme. Speaker age estimation using HiddenMarkov Model weight supervectors. In 11th IEEE Int. Conf. InformationScience, Signal Processing and their Applications (ISSPA), pages 517–521,2012.

[14] B. Bakker and T. Heskes. Task clustering and gating for bayesian multi–tasklearning. Journal of Machine Learning Research, 4:83–99, 2003.

[15] A. Barron. Universal approximation bounds for superpositions of a sigmoidalfunction. IEEE Transactions on Information Theory, 39(3):930–945, 1993.

[16] D. Basak, S. Pal, and D. Patranabis. Support vector regression. NeuralInformation Processing – Letters and Reviews, 11(10):203–224, 2007.

[17] R. Battiti. First and second order methods for learning: Between steepestdescent and newton’s method. Journal of Neural Computation, 4(2):141–166,1992.

[18] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[19] T. Bocklet, A. Maier, and E. Noth. Age determination of children in preschooland primary school age with GMM-based supervectors and support vectormachines regression. In Proc. Text, Speech and Dialogue, pages 253–260, 2008.

[20] T. Bocklet, G. Stemmer, V. Zeissler, and E. Noth. Age and Gender RecognitionBased on Multiple Systems Early vs. Late fusion. In Proc. 11th AnnualConference of the International Speech Communication Association, pages2830–2833, 2010.

[21] A. Braun and T. Rietveld. The influence of smoking habits on perceived age.In Proc. ICPhS, volume 95, pages 294–297, 1995.

[22] N. Brummer. FoCal Multi-class: Toolkit for Evaluation, Fusion and Calibrationof Multi-class Recognition Scores, 2007.

[23] N. Brummer, L. Burget, J. H. Cernocky, O. Glembek, F. Grezl, M. Karafiat,D. van Leeuwen, P. Matejka, P. Schwarz, and A. Strasheim. Fusion of hetero-geneous speaker recognition systems in the STBU submission for the NISTspeaker recognition evaluation 2006. IEEE Trans. Audio, Speech, and Lang.Process., 15(7):2072–2084, 2007.

82

Bibliography

[24] N. Brummer and D. Van Leeuwen. On calibration of language recognitionscores. In IEEE Odyssey Speaker and Language Recognition Workshop, pages1–8, 2006.

[25] W. Campbell, D. Sturim, and D. Reynolds. Support vector machines usinggmm supervectors for speaker verification. IEEE Signal Processing Letters,13(5):308–311, 2006.

[26] R. Caruana. Multi–task learning. Journal of Machine Learning, 28:41–75,1997.

[27] R. Caruana. Multitask Learning. PhD thesis, School of Computer Science,Carnegie Mellon University, Pittsburgh, 1997.

[28] C.-C. Chen, P.-T. Lu, M.-L. Hsia, J.-Y. Ke, and O.-C. Chen. Gender-to-Agehierarchical recognition for speech. In Circuits and Systems (MWSCAS), 2011IEEE 54th International Midwest Symposium on, pages 1–4, 2011.

[29] J. R. Cohen, T. H. Crystal, A. S. House, and E. P. Neuburg. Weighty voicesand shaky evidence: a critique. Journal of the Acoustical Society of America,68:1884–1886, 1980.

[30] C. Darwin. The Descent of Man and Selection in Relation to Sex. London:Murray, 1871.

[31] K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter,K. Pelckmans, B. De Moor, J. Vandewalle, and J. A. K. Suykens. Ls-svmlab1.8toolbox. http://www.esat.kuleuven.be/sista/lssvmlab/.

[32] N. Dehak. Discriminative and Generative Approaches for Long- and Short-TermSpeaker Characteristics Modeling: Application to Speaker Verification. PhDthesis, Ecole de Technologie Superieure de Montreal, Montreal, QC, Canada,2009.

[33] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-endfactor analysis for speaker verification. IEEE Trans. Audio, Speech, and Lang.Process., 19(4):788–798, 2011.

[34] N. Dehak, O. Plchot, M. H. Bahari, and H. Van hamme. Gmm weightsadaptation based on subspace approaches for speaker verification. In SPEAKERODYSSEY, the speaker and language recognition workshop. Submitted, 2014.

[35] N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, and R. Dehak. Languagerecognition via ivectors and dimensionality reduction. In Proc. Interspeech,pages 857–860, 2011.

[36] H. Demuth, M. Beale, and M. Hagan. Neural Network Toolbox User’s Guide,2009.

83

Bibliography

[37] G. Dobry, R. Hecht, M. Avigal, and Y. Zigel. Dimension reduction approachesfor SVM based speaker age estimation. In Proc. Interspeech, pages 2031–2034,2009.

[38] G. Dobry, R. M. Hecht, M. Avigal, and Y. Zigel. Supervector DimensionReduction for Efficient Speaker Age Estimation Based on the Acoustic SpeechSignal. IEEE Trans. Audio, Speech, and Lang. Process., 19(7):1975–1985, 2011.

[39] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Supportvector regression machines. Advances in neural information processing systems,pages 155–161, 1997.

[40] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks withkernel methods. Journal of Machine Learning Research, 6:615–637, 2005.

[41] G. Fant. Acoustic Theory of Speech Production. The Hague: Mouton, 1960.

[42] T. W. Fitch. Vocal tract length and formant frequency dispersion correlate withbody size in rhesus macaques. Journal of the Acoustical Society of America,102:1213–1222, 1997.

[43] J. A. Freeman and D. M. Skapura. Neural Networks Algorithms, Applicationsand Programming Techniques. Addison-Wesley Publishing Company, 1991.

[44] T. Ganchev, I. Mporas, and N. Fakotakis. Audio features selection for automaticheight estimation from speech. Artificial Intelligence: Theories, Models andApplications Lecture Notes in Computer Science, 6040:81–90, 2010.

[45] H. R. Gilbert and G. G. Weismer. The effects of smoking on the speakingfundamental frequency of adult women. Journal of Psycholinguistic Research,3(3):225–231, 1974.

[46] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. Emerald,1981.

[47] U. G. Goldstein. An articulatory model for the vocal tracts of growing children.PhD thesis, Massachusetts Institute of Technology, 1980.

[48] J. Gonzalez. Formant frequencies and body size of speaker: a weak relationshipin adult humans. Journal of Phonetics, 32:277–287, 2004.

[49] J. Gonzalez and A. Carpi. Early effects of smoking on the voice: A multidi-mensional study. Med. Sci. Monit., pages 649–656, 2004.

[50] M. T. Hagan and M. Menhaj. Training feed-forward networks with the mar-quardt algorithm. IEEE Transactions on Neural Networks, 5(6):989–993, 1994.

[51] A. Hatch, S. Kajarekar, and A. Stolcke. Within-class covariance normalizationfor SVM-based speaker recognition. In Proc. Interspeech, volume 4, 2006.

84

Bibliography

[52] S. Haykin. Adaptive Filter Theory. Prentice Hall, 1996.

[53] S. Haykin. Neural Networks A Comprehensive Foundation. Prentice Hall. NewJersey, 2nd edition, 1999.

[54] K. Hornik. Approximation capabilities of multilayer feedforward networks.Journal of Neural Networks, 4(2):251–257, 1991.

[55] M. Ichino, N. Komatsu, W. Jian-Gang, and Y. W. Yun. Speaker genderrecognition using score level fusion by adaboost. In Proc. of 11th InternationalConference on Control Automation Robotics & Vision (ICARCV), pages 648–653, 2010.

[56] P. Kenny, G. Boulianne, and P. Dumouchel. Eigenvoice modeling with sparsetraining data. IEEE Transaction on Speech and Audio Processing, 13(3):345–354, 2005.

[57] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel. A study ofinterspeaker variability in speaker verification. IEEE Trans. Audio, Speech,and Lang. Process., 16(5):980–988, 2008.

[58] M. Kockmann, L. Burget, and J. Cernocky. Brno University of Technology Sys-tem for Interspeech 2010. In Proc. 11th Annual Conference of the InternationalSpeech Communication Association, pages 2822–2825, 2010.

[59] H. J. Kunzel. How well does average fundamental frequency correlates withspeaker height and weight? Journal of Phonetica, 46:117–125, 1989.

[60] N. J. Lass and W. S. Brown. Correlational study of speakers heights, weights,body surface areas, and speaking fundamental frequencies. Journal of theAcoustical Society of America, 63:1218–1220, 1978.

[61] N. J. Lass and M. Davis. An investigation on speaker height and weightidentification. Journal of the Acoustical Society of America, 60:700–703, 1976.

[62] M. Li, K. J. Han, and S. Narayanan. Automatic speaker age and genderrecognition using acoustic and prosodic level information fusion. ComputerSpeech and Language, 27(1):151 – 167, 2013.

[63] Y. Lu, F. Lu, S. Sehgal, S. Gupta, J. Du, C. H. Tham, P. Green, and V. Wan.Multitask learning in connectionist speech recognition. In Proc. of 10th Aus-tralian International Conference on Speech Science & Technology, Sydney,Australia, 2004. 10th Australian International Conference on Speech Science &Technology.

[64] B. Ma, H. M. Meng, and M. W. Mak. Effects of device mismatch, languagemismatch and environmental mismatch on speaker verification. In Proc. IEEEInt. Conf. Acoustics, Speech and Signal (ICASSP), volume 4, pages 298–301,2007.

85

Bibliography

[65] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447,1992.

[66] D. Mahmoodi, A. Soleimani, H. Marvi, F. Razzazi, M. Taghizadeh, andM. Mahmoodi. Age estimation based on speech features and support vectormachine. In 3rd Computer Science and Electronic Engineering Conference,pages 60–64, 2011.

[67] D. Martınez, O. Plchot, L. Burget, O. Glembek, and P. Matejka. Languagerecognition in ivectors space. Proceedings of Interspeech, Firenze, Italy, pages861–864, 2011.

[68] M. McLaren and D. van Leeuwen. A simple and effective speech activitydetection algorithm for telephone and microphone speech. In Proc. NIST SREWorkshop, 2011.

[69] H. Meinedo and I. Trancoso. Age and gender classification using fusion ofacoustic and prosodic features. In Proc. INTERSPEECH, pages 2818–2821,2010.

[70] F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann, C. Muller,R. Huber, B. Andrassy, J. Bauer, et al. Comparison of four approaches to ageand gender recognition for telephone applications. In Proc. IEEE Int. Conf.Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages IV–1089,2007.

[71] M. F. Moller. A scaled conjugate gradient algorithm for fast supervised learning.Journal of Neural Networks, 6:525–533, 1993.

[72] E. S. Morton. On the occurrence and significance of motivation–structuralrules in some bird and mammal sounds. American Naturalist, 111:855–869,1977.

[73] C. MüLler, editor. Application of Speaker Classification in Human MachineDialog Systems, volume 4343. Springer-Verlag Berlin Heidelberg, 2007.

[74] C. MüLler, editor. Speaker Classification in Forensic Phonetics and Acoustics,volume 4343. Springer-Verlag Berlin Heidelberg, 2007.

[75] C. Muller and F. Burkhardt. Combining short-term cepstral and long-term pitchfeatures for automatic recognition of speaker age. In Proc. INTERSPEECH,pages 2277–2280, 2007.

[76] J. Muller. The physiology of the senses, voice and muscular motion with mentalfaculties. . London: Walton and Maberly, 1848.

[77] Y. Muthusamy, E. Barnard, and R. Cole. Reviewing automatic languageidentification. Signal Processing Magazine, IEEE, 11(4):33–41, 1994.

86

Bibliography

[78] V. E. Negus. The Comparative Anatomy and Physiology of the Larynx. Hafner,New York, 1949.

[79] G. Neiman and J. A. Applegate. Accuracy of listener judgments of perceivedage relative to chronological age in adults. Folia Phoniatr Logop, 42:327–330,1990.

[80] J. Pelecanos and S. Sridharan. Feature warping for robust speaker verification.pages 213–218, 2001.

[81] B. L. Pellom and J. H. L. Hansen. Voice analysis in adverse conditions: thecentennial olympic park bombing 911 call. In Proc. Of the 40th Midwestsymposium on circuits and systems, 1997.

[82] V. P. Plagianakos, D. Sotiropoulos, and M. N. Vrahatis. An improved back-propagation method with adaptive learning rate. Technical report, Universityof Patras, Department of Mathematics, 1998.

[83] A. H. Poorjam, M. H. Bahari, V. Vasilakakis, and H. Van hamme. Heightestimation from speech signals using i-vectors and least-squares support vectorregression. In Proc. 37th International Conference on Telecommunications andSignal Processing, 2014.

[84] L. A. Ramig and R. Ringel. Effects of physiological aging on selected acousticcharacteristics of voice. Journal of Speech, Language and Hearing Research,26(1):22–30, 1983.

[85] W. J. Ryan. Acoustic aspects of the aging voice. Journal of Gerontology,27:256–268, 1972.

[86] L. E. Scales. Introduction to Non-Linear Optimization. Springer-Verlag, 1985.

[87] S. Schotz. Perception, Analysis and Synthesis of Speaker Age. PhD thesis,Department of Linguistics and Phonetics, Centre for Languages and Literature,Lund University, 2006.

[88] S. Schotz. Perception, analysis and synthesis of speaker age, volume 47. Citeseer,2006.

[89] S. Schotz and C. Muller. A study of acoustic correlates of speaker age. SpeakerClassification II, Lecture Notes in Computer Science, 4441:1–9, 2007.

[90] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. MüLler, andS. Narayanan. Paralinguistics in speech and language – state-of-the-art andthe challenge. Computer Speech & Language, 27(1):4–39, 2013.

[91] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval,M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al. The INTERSPEECH2013 computational paralinguistics challenge: social signals, conflict, emotion,

87

Bibliography

autism. In Proceedings INTERSPEECH 2013, 14th Annual Conference of theInternational Speech Communication Association, 2013.

[92] B. Schuller, M. Wöllmer, F. Eyben, G. Rigoll, and D. Arsic. Semantic speechtagging: Towards combined analysis of speaker traits. In Audio EngineeringSociety Conference: 42nd International Conference: Semantic Audio, Jul 2011.

[93] T. Shipp and H. Hollien. Perception of the aging male voice. Journal of Speech,Language and Hearing Research, 12(4):703–710, 1969.

[94] S. Shum, N. Dehak, R. Dehak, and J. Glass. Unsupervised speaker adaptationbased on the cosine similarity for text-independent speaker verification. InProc. Odyssey, 2010.

[95] E. Singer, P. Torres-Carrasquillo, D. Reynolds, A. McCree, F. Richardson,N. Dehak, and D. Sturim. The mitll nist lre 2011 language recognition system.Speaker Odyssey 2012, pages 209–215, 2012.

[96] A. Smola and B. Scholkopf. A tutorial on support vector regression. Statisticsand computing, 14(3):199–222, 2004.

[97] J. A. Snyman. Practical mathematical optimization: an introduction to basicoptimization theory and classical and new gradient-based algorithms, volume 97.Springer Science+ Business Media, 2005.

[98] D. Sorensen and Y. Horii. Cigarette smoking and voice fundamental frequency.Journal of communication disorders, 15(2):135–144, 1982.

[99] W. Spiegl, G. Stemmer, E. Lasarcyk, V. Kolhatkar, A. Cassidy, B. Potard,S. Shutn, Y. Song, P. Xu, P. Beyerlein, J. Harnsberger, and E. Noth. Ana-lyzing features for automatic age estimation on cross-sectional data. In Proc.INTERSPEECH, pages 2923–2926, 2009.

[100] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vande-walle. Least squares support vector machines. World Scientific, 2002.

[101] D. C. Tanner and M. E. Tanner. Forensic aspects of speech patterns: voiceprints, speaker profiling, lie and intoxication detection. Lawyers and JudgesPublishing, 2004.

[102] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: efficientboosting procedures for multiclass object detection. In Proc. of CVPR, pages762–769, 2004.

[103] W. A. Van Dommelen. Speaker height and weight identification: a reevaluationof some old data. Journal of Phonetics, 21:337–341, 1993.

[104] W. A. Van Dommelen and B. H. Moxness. Acoustic parameters in speakerheight and weight identification: sex-specific behaviour. Language and Speech,38:267–287, 1995.

88

Bibliography

[105] C. van Heerden, E. Barnard, M. Davel, C. van der Walt, E. van Dyk, M. Feld,and C. Muller. Combining regression and classification methods for improvingautomatic speaker age recognition. In Acoustics Speech and Signal Processing(ICASSP), 2010 IEEE International Conference on, pages 5174–5177, 2010.

[106] D. van Leeuwen and M. H. Bahari. Calibration of probabilistic age recognition.In Proc. Interspeech. Portland, USA, 2012.

[107] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NewYork, 2000.

[108] D. Ververidis and C. Kotropoulos. Automatic speech classification to fiveemotional states based on gender information. In Proc. of 12th European SignalProcessing Conference., pages 341–344, 2004.

[109] I. Vincent and H. Gilbert. The effects of cigarette smoking on the female voice.Logopedics Phoniatrics Vocology, 37(1):22–32, 2012.

[110] T. Vogt and E. Andre. Improving automatic emotion recognition from speechvia gender differentiation. In Proc. of Language Resources and EvaluationConference., 2006.

[111] F. Weninger, E. Marchi, and B. Schuller. Improving recognition of speakerstates and traits by cumulative evidence: Intoxication, sleepiness, age andgender. In Proc. INTERSPEECH, 2012.

[112] M. Wolters, R. Vipperla, and S. Renals. Age recognition for spoken dialoguesystems: Do we need it? In Proc. INTERSPEECH, pages 1435–1438, 2009.

[113] S. A. Xue and D. Deliyski. Effects of aging on selected acoustic voice param-eters: Preliminary normative data and educational implications. Journal ofEducational Gerontology, 21:159–168, 2001.

[114] R. Yager. An extension of the naive bayesian classifier. Information Sciences,176(5):577–588, 2006.

[115] X. Zhang, K. Demuynck, and H. Van hamme. Rapid Speaker Adaptationin Latent Speaker Space with Non-negative Matrix Factorization. SpeechCommunication, 2013.

[116] M. A. Zissman and K. M. Berkling. Automatic language identification. SpeechCommunication, 35(1):115–124, 2001.

[117] M. Zweig and G. Campbell. Receiver-operating characteristic (roc) plots: afundamental evaluation tool in clinical medicine. Clin. Chem., 39(4):561–577,1993.

89

Speaker Profiling for Forensic Applications

Documents