Development and Evaluation of Automatic Speaker based- Audio Identification and Segmentation for Broadcast News Recordings Indexation

Development and Evaluation of Automatic -Speaker based- Audio Identificationand Segmentation for Broadcast News Recordings Indexation

Messaoud Bengherabi

Centre de Developpement desTechnologies Avancees

Algeria

Abstract

In this paper, we describe an automatic- speakerbased- audio segmentation and identiJication system forbroadcasted news indexation purposes. We speciJicallyfocus on speaker identification and audio scenedetection. Speaker identiJication (SI) is based on thestate of the art Gaussian mixture models, whereas scenechange detection process uses the classical BayesianInformation Criteria (BIC) and the recently proposedDISTBIC algorithm. In this work, the effectiveness ofMel Frequency Cepstral coefficients MFCC, LinearPredictive Cepstral Coefficients LPCC, and Log AreaRatio LAR coefficients are compared for the purpose oftext-independent speaker identification and speakerbased audio segmentation. Both the FisherDiscrimination Ratio- feature analysis and performanceevaluation in terms of correct identification rate on theTIMIT database showed that the LPCC outperforms theother features especially for low order coefficients. Ourexperiments on audio segmentation module showed thatthe DISTBIC segmentation technique is more accuratethan the BIC procedure especially in the presence ofshort segments.

1. Introduction and Motivation

In recent years, video indexing and retrieval hasbecome an active field of research. Many applications,especially in entertainment, education, medicine, etc.have shown real needs for content-based retrieval ofvideo information. Nevertheless, this task is still a verychallenging one; In fact, in order to retrieve theinformation from a digital collection, we cannot searchnatively the raw data as we do for alphanumericalinformation but only some kind of descriptionssummarizing their contents. Another source forcollecting video indexation information, which weconsider in this paper, is the audio source attached to thevideo data. In fact, information about the speakeridentity for instance is very helpful, especially inbroadcasted news indexation. It denotes in general ascene change in the video stream. The motivation for ourwork comes from two facts:

Abdenour Sehad

Centre de Developpement desTechnologies Avancees

Algeria

/ Time processing needed to perform videoanalysis is far larger than the one needed toprocess audio data.

/ It is more "easier" to recognize a person byanalysing his/her voice than by tracking his/herface in a video stream.

In this work, we focus on two major processes: (a)speaker Identification process and (b) speaker basedaudio segmentation process. Speaker identification (SI)is based on the state of the art Gaussian mixture models[1] whereas scene change detection process uses theclassical Bayesian Information Criteria (BIC) [2] and therecently proposed algorithm DISTBIC [3]. The blockdiagram of the speaker indexing system is shown inbelow in Figure. 1

Select an audio file, Convert it to a.WAVE file and resample it.

Pre-processing and Feature VectorExtraction: Mel Frequency Cepstral

Coefficients (MFCC), LinearPredictive Cepstral Coefficients

(LPCC) and Log Area Ratio (LAR).It

Audio scene change by BayesianInformation Criteria (BIC) or

DISTBIC

Gaussian Mixture Model (GMM) forspeaker identification.

1

Figure.l.Block diagram of the speaker indexing system

The performance of a speaker identification system ishighly dependent on the quality of the selected speechfeatures. Speaker differences in the acoustic signal arecoded in complex way in both segmental (phoneme)level, prosodic suprasegmental) level and lexical level.Modelling of prosody and lexical features has showngreat promises in automatic speaker recognition systemslately [4]. However, the segmental features are still the

0-7803-9521-2/06/$20.00 §2006 IEEE. 1 230

https://www.researchgate.net/publication/220120403_DISTBIC_A_speaker-based_segmentation_for_audio_data_indexing?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

https://www.researchgate.net/publication/224741553_The_SuperSID_project_exploiting_high-level_information_for_high-accuracy_speaker_recognition?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

most popular approach because of their easy extractionand modelling. Two most popular features are Mel-Frequency Cepstral Coefficients (MFCC) and LinearPredictive Cepstral Coefficients (LPCC) [5]. In a recentwork [6] it was claimed that the LAR coefficient derivedfrom the LPC analysis outperform the MFCC parameterswith less computational complexity especially for cleanspeech. Our interest in this feature is also justified bytheir direct use in the bit stream of the GSM speechcoder because of their robustness to quantization noise.

The rest of the paper is organized as follow: section 2presents a detailed description of the identificationengine. Section 3 presents the audio segmentationmodule. Implementation details and comparative studyresults are given in section 4. Finally, section 5concludes the paper

2. The GMM Speaker Identification Engine

Speaker identification (SI) is the process of findingthe identity of an unknown speaker by comparing his/hervoice with voices of registered speakers in the database.It's a one-to-many comparison. The basic structure of SIsystem (SIS) is shown in Figure .2.

p-- -k...SpeakerData base

Speaker Model I

Speaker Model 2

p d \Speaker ID

Speaker Model M

Figure.2 Basic structure of speaker identification

Different classification paradigms using differentmodelling techniques for text independent speakeridentification could be found [7], such as VectorQuantization Modelling, Neural Network and GaussianMixture Model. Vector Quantization Modelling [8] is atemplate modelling where the temporal information ofthe features is averaged out. A codebook is used torepresent the features of the speech. However, thismethod suffers from the fact that the averaging processdiscards much speaker-dependent information and itrequires long speech utterance (more than 20 seconds) toderive stable long-term speech statistics. NeuralNetwork [9] does not train the individual modelseparately. It is trained to model the decision functionwhich best discriminates speakers within a known set.The merits of Neural Network is that it requires a smallnumber of parameters and have better performance inrecognition as compare to Vector Quantization.However, the drawback is that the complete network

must be retrained when a new speaker or language isadded to the data set. This is very time consuming andinconvenient. Gaussian Mixture Model is a statisticalmodel that models the underlying sound of a person'svoice. Gaussian Mixture Model [1] represents the broadacoustic classes which reflect general speaker-dependentvocal tract configurations. The vocal tract configurationof a person is unique, so it can be used to model thespeaker identity. The Gaussian mixture density providesa smooth approximation to the sample distribution ofobservations obtained from utterances by a givenspeaker. It is computationally efficient and only shortutterances are needed. Furthermore, it is text andlanguage independent. For these reasons, we retained theGaussian Mixture Model as the basis for ouridentification engine.

Gaussian Mixture Model is a type of density model thatuse to represent the speaker model. It strictly follows theprobabilistic rules. Advantage of Gaussian mixturemodel is text independent, robust, computationallyefficient and easy to implement. GMM is commonlyused for language identification, gender identificationand speaker identification. A Gaussian mixture model ismodelled by many different Gaussian distributions. Eachof the Gaussian distribution has its mean; variance andweighting in the Gaussian mixture model it models. Thisis shown in Figure.3

Figure.3: Construction ofGMMAssume M is the number of small Gaussian distributionused to model the Gaussian mixture model. Thefollowing equation is used to calculate the GaussianMixture Density for D-dimensional random vector:

p(.) Z pibi(?)=11

where Y is a D-dimensional random vector

bi ( 2)= D,2 ,exp2 2 (f Th'1( )

is t...

V is the input D-dimensional random vector

0-7803-9521-2/06/$20.00 §2006 IEEE.

PiU v2 PUi

1 2 i

P P2 P

1231

https://www.researchgate.net/publication/252512346_Speaker_Identification_Based_on_Log_Area_Ratio_and_Gaussian_Mixture_Models_in_Narrow-Band_Speech?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

https://www.researchgate.net/publication/232634243_Text-independent_talker_identification_with_neural_networks?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

https://www.researchgate.net/publication/3938006_Speaker_verificationrecognition_and_the_importance_of_selective_feature_extraction_review?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

https://www.researchgate.net/publication/2985221_Speaker_recognition_A_tutorial?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

https://www.researchgate.net/publication/224737417_Evaluation_of_a_vector_quantization_talker_recognition_system_in_text_independent_and_text_dependent_modes?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

bi is the score for Gaussian distribution i

Pui is the mean for Gaussian distribution i

Ei is the covariance matrices for Gaussian distributioniPi is the weighting of the score for the Gaussian

Mdistribution i where P 1.

2.1 Training Procedure for Gaussian Mixture Model

Before you can identify a target speaker, an importantstep must be done. That is training of the models. Forevery speaker, we need to collect clips that are longenough for training. After training, a set of parametersare obtained. They are the mean, variance and theweighting for the small Gaussian distribution. They arerepresented by A= {pi, -, I} in which pi represents

the Gaussian mixture weights.A Represents the mean

and Ei represents the variance. A possible approach todo training is the Maximum Likelihood parameterestimation. The aim of the Maximum Likelihoodparameter estimation is to find the model parameterswhich maximize the likelihood of the training data.Maximum Likelihood parameters can be estimated usinga specialized version of expectation-maximization (EM)algorithm [10]. The basic idea is that we first begin withan initial model, and then we estimate the new modelsuch that the new model better represents the data. Aftertraining, we will obtain the mean, variance, andweighting of each Gaussian distribution ( i).

Steps for training:

1. First collect a set of sound clips that is long enoughfor each speaker you want to identified

4. Check if the newly calculated parameter is moresuitable to model the speaker by using the followingformula.

p(i t t) gb(,

We use part of the training data to test theparameter i If the score p(X i) is larger than thescorep(X A), then we will use the newly calculated

parameter A to do the training again. We use the newparameter i when the following equation is satisfied.

5. Continue to do the training by repeating step (3) andstep (4). It has been proven that this method is aconvergent approach [13]. In order words, when werepeat to train the parameter i, we will get theparameter which is more close to the actual parameterfor modelling the speaker (the ratio between thedifference in the log-likelihood of the current andprevious iteration and the log likelihood of the previousiteration is in the order of 10-4. The error between theactual parameter for the model and A become smallerand smaller through training.

A critical factor in doing the training is that we need todo the initialization of the model parameters beforeapplying the EM algorithm. Actually the EM algorithmcan only guaranteed to find a local maximum likelihoodmodel. However, the likelihood equation for the GMMmay have several local maxima. As a result, differentinitialization model may lead to different local maxima.In our case, we have used the well known LBG vectorquantization algorithm [11] to initialize the EM. Wehave noticed that 15 iterations are sufficient for theconvergence.

2. Begin with an initial model X Then we calculatethe A (new mean, variance, weighting) for themodel by using the following formula.

3. Gaussian mixture:

Weights:

Means:

I TPi= p(ilvt,A)Tt=l

2 t= p(i t

ZP

p(i Xt, A)Variances:

-2 Zt=,p(iPO )xt

T p(i .t A4)2

Pi

3- Audio Scene Change Detection

Speech segmentation is the location of boundariesbetween speakers in the audio stream. In the last years,many efforts have been devoted to the problem of audiosegmentation by the research community. This is due tothe number of applications of this procedure, that rangefrom the information extraction from audio data (e.g.broadcast news, meetings recording), to the automaticindexing of multimedia data

3-1 BIC Algorithm

A widely used approach to audio segmentation is basedon the Bayesian Information Criterion (BIC) [2]. BIC isa likelihood criterion which is penalized by the largenumber of parameters in the model. BIC procedure is tochoose the model such that BIC criterion is maximized.

0-7803-9521-2/06/$20.00 §2006 IEEE. 1 232

https://www.researchgate.net/publication/221995817_Maximum_Likelihood_from_Incomplete_Data_Via_EM_Algorithm?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

https://www.researchgate.net/publication/260673067_An_Algorithm_for_Vector_Quantizer_Design?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

https://www.researchgate.net/publication/215721461_Maximum_Likelihood_From_Incomplete_Data_Via_The_EM_algorithm?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

BIC criterion is widely used for model identification instatistical modeling. Applied to the segmentation ofaudio streams, it can be explained as follows. Let usdefine some notations for detection of acoustic change inaudio stream using BIC criterion:

* d is the number of feature used to model eachframe of the audio stream.

* N is the number of frames in the audio stream.* X {xi C Rd, i =1,2,...,N} is the set of feature

vectors representing the entire audio stream. (xlis the first feature vector, xi is the i-th featurevector).

Assume X is drawn from an independent multivariateGaussian process: X N( ,t, I ) where the ,t is themean vector and I is the full covariance matrix whichdescribe the Gaussian process. If we have a singlechange point occurring at time i. We can view Ho and HIas two different models which describe the data indifferent way as shown in Fig. What we want to do is tofind a model which models the audio stream better. Ho isa model which models the data as one Gaussiandistribution. HI is another model which models the dataas two Gaussians distribution.

HO: Xl ... XN N(H, )

HI: XI Xi -N( ,Ll, El )Xi+, .XN- N( L2 , Y2)(1)

The maximum likelihood ratio statistics is:

R(i) = N log I - N, log ElI - N2 log 12 (2)

where X, El X2 are the sample covariance matrices fromdata {Xl ... XN}, {XI ... Xi } and from Xi+ ...XN}respectively.BIC values can be expressed as:

BIC(i) = R(i)-k P (3)

where P is the penalty given by:

P= 1 (d+ 1 d(d+1)) logN (4)

and X is the penalty weight. If BIC(i) is positive, thenmodel the data as two Gaussians(HI) is better thanmodel the data as one Gaussian(HO). The i thatmaximizes the BIC(i) is the time of change point. Thatmeans i that make {maxi BIC(i)} > 0 is the change point.Therefore maximum likelihood estimate of the changingpoint can be expressed as:

t = arg(maxi BIC(i)) (5)

The performance of BIC-based systems is very sensitiveto the selection of the penalty weight B. Anotherparameter that requires special attention is N, i.e. the sizeof the analysis window, since reliability of Gaussianestimates depends directly on this value. The Influence

of parameters X, N and efficient algorithms for theimplementation of the BIC algorithm for multiplechange detection are given in [12].

3-2 DISBIC Algorithm

In our developed software, audio scene changedetection can be also performed using the DISTBICalgorithm proposed by Delacourt et al.[3]. DISTBIC isa two-pass change detection technique. In the first passwe used the Generalized Likelihood Ratio (GLR)measure [3] to determine the turn candidates. Gaussianprobability density function parameters were estimatedfor two-second-long adjacent windows placed at everypoint in the audio stream. When the criterion functionvalues of bordering windows reached a local maximum,a segment boundary was generated. In the second passhypothesis testing with the BIC algorithm was applied tovalidate or discard candidates.

4. Implementation and Results

For performance evaluation of the identification andsegmentation modules under different parameterset ups. We have developed a graphical frameworkunder Visual C++.NET. In the design of graphical userinterface GUI, we have taken into account the structureof a whole system. Moreover, the aim of our work is todesign modular software, implying that some parametersof the system can be adjusted by the user, so as toencourage and help experimentation on the use ofdifferent feature sets, and to do tests under differentconditions. This Graphical User Interface is very usefultool especially for broadcast news indexation, researchand teaching purposes. One can choose between manydifferent adjustments of parameters and thereforeexamine the way each parameter can actually affects theperformance of the system. The screen-shot of thedeveloped software is shown in figure.4. In order tostudy the relative effectiveness for speaker identificationof the different features, the speech corpus used consistsof 462 speakers selected from the TIMIT database [13](630 speakers). TIMIT is a clean speech databaserecorded using a high quality microphone sampled at16 kHz. In TIMIT, each speaker produces 10 sentences,the first 8 sentences were used for training and the last2 sentences were used for testing. We should mentionthat silence elimination is a major problem for bothspeech and speaker recognition. Inaccurate endpointsdetection will decrease the performance of the speakeridentification. In our work we have used short-termpower estimate and short term zero crossing rate. To finda threshold that is used to detect silence segments. Thisapproach is a modified version of the Rabiner's one forisolated words [14]. To have an initial guess of thediscrimination power of each feature a known figure ofmerit techniques that use analysis of the variancemethods have been used. These methods involve thecalculation of Fishers F ratio tests.

0-7803-9521-2/06/$20.00 §2006 IEEE. 1 233

https://www.researchgate.net/publication/252976070_Algorithm_for_determining_the_endpoints_of_isolated_utterances?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==

Fieure.4. the screen- shot of the develoDed software

F-ratio is a figure of merit to evaluate the effectivenessof each feature coefficients. The F ratio is given by

FDR speaker vaianceamongclassesspeaker vaiancewithinclasses

A high value of FDR ratio tests is desirable for SI.Figure.5 shows the F-ratio score of the first 20 MFCC's,20 LPCC's and 20 LAR's features. It can be clearly seenthat the six low order LPCC coefficients have higher F-ratio scores than the MFCC and LAR counterparts. Wesee also that the F-ration scores of the first five low orderLAR coefficients are very slightly higher than theMFCC ones.

The F-ratio score of LAR,LPCC and MFCC

XPCCLAR

MFCC

200

150

100

50 -r

Figure.5 the F-ratio scores of the MFCC, LPCC and LARfeatures

The filterbank employed in the MFCC computationconsists of 23 triangular filters equispaced on the melfrequency scale and the initial coefficients was discardedfrom the feature vector. To see the effectiveness ofthese features for closed set text independent speakeridentification, models of 16 and 32 Gaussiancomponents were used. The identification tests wereconducted by 462 speakers where approximately 24seconds of speech for training and 3-4 seconds fortesting. In each test, each speaker conducted 2 trials onthe system (a total of 924 tests). The obtained correctidentification rates for 10, 14, and 20 different featurevector lengths are listed in tables 1 -3.

Table .1 Correct identification rate of the MFCC, LAR, andLPCC on the on the TIMIT database for d=10

parameterModelorder

I MFCC |LPCC |LAR

16 j92.53 96.21 85.8232 89.5 97.29 87.01

Table .2 Correct identification rate of the MFCC, LAR, andLPCC on the on the TIMIT database for d=14

Model parameterorder

MFCC LPCC LAR

16 95.77 98.26 94.2632 11 97.29 98.59 94.58

Another important observation from figure.5 is that theMFCC coefficients have generally a higher Fisher ratioscores than the other parameters starting from theseventh coefficient. We should mention that theparameters were extracted from a 256 samples (no zeropadding in FFT calculation) hamming windowedframes- progressing at an overlap of 50%.

Table.3 Correct identification rate of the MFCC, LAR, andLPCC on the on the TIMIT database for d =20

Model parameterorder

MFCC LPCC |LAR

16 1 98.05 98.70 97.1832 97.18 98.80 97.07

0-7803-9521-2/06/$20.00 §2006 IEEE.

---I

250 r

1234

The main conclusion from these results is that: TheLPCC outperforms in all cases the other parametersespecially for low order coefficients where thedifference in performance is significant as shown intablel. This is expected from the F ratio score results,and the performance of the MFCC is better than theLAR counterparts and this also can be expected from thef ration analysis.

Our experiments on audio segmentation module forperformance evaluation in terms of false alarm rate(occurs when a speaker turn is detected although it doesnot exist) and missed Detection rate (occurs when theprocess does not detect an existing speaker turn) weredone on two audio-streams. The first is constructed byconcatenating 60 wave files of an average length of3 seconds (short segments) taken from the TIMITdatabase and the second by concatenating 22 wave filesfrom the ELSDSR1 (English Language Speech Databasefor Speaker Recognition) database [15]. Performanceevaluation tests showed that the DISTBIC segmentationtechnique which is organized into two passes: at first themost likely speaker turns are detected, and then they arevalidated or discarded is more accurate than the BICprocedure especially in the presence of short segments.Finally, we should mention that our software has beenexperienced on real broadcasted Arabic news and givesa very promising result.

4. Conclusion

In this work, we have presented and developed anautomatic- speaker basedaudio segmentation andidentification system. Being language independent,make this software very useful tool especially forbroadcast news indexation, research and teachingpurposes. The effectiveness of Mel Frequency Cepstralcoefficients MFCC, Linear Predictive CepstralCoefficients LPCC, and Log Area Ratio LARcoefficients are compared for the purpose of text-independent speaker identification and speaker basedaudio segmentation. Both the Fisher DiscriminationRatio- feature analysis and performance evaluation interms of correct identification rate on the TIMITdatabase showed that the LPCC outperforms the otherfeatures especially for low order coefficients. Ourexperiments on audio segmentation module showed thatthe DISTBIC segmentation technique is more accuratethan the BIC procedure especially in the presence ofshort speech segments.

5- References

[1] D.A. Reynolds, R.C. Rose, «Robust text-independentspeaker identification using Gaussian mixture speakermodels", IEEE Trans on speech and audio processing,Vol 3, No 1, Jan 1995.

[2] S. S. Chen and P. S. Gopalakrishnan, "Speaker,environment and channel change detection and clusteringvia the Bayesian Information Criterion," in Proc. of theDARPA Broadcast News Transcription & UnderstandingWorkshop, Lansdowne, VA, 1998

[3] Delacourt, P.,Wellekens, C. J.,." DISTBIC: A speakerbased segmentation for audio data indexing ". SpeechCommunication journal. Volume 32, September 2000,pp:1 1-126.

[4] D.A. Reynolds, W. Andrews, J. Campbell, J. Navratil, B.Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R.Mihaescu, J. Godfrey, D. Jones, and B. Xiang. TheSuperSID project: exploiting high-level information forhigh-accuracy speaker recognition. In Proc. Int. Conf. onAcoustics, Speech, and Signal Processing (ICASSP 2003),pages 784-787, Hong Kong, 2003.

[5] P. Pravinkumar and B, M Wasfy "SpeakerVerification/Recognition and the Importance of SelectiveFeature Extraction: Review", Proc. IEEE, 2001.

[6] D. Chow, and W. H. Abdulla "Speaker IdentificationBased on Log Area Ratio and Gaussian Mixture Modelsin Narrow-Band Speech," Lecture Notes ArtificalIntelligence LNAI 3157, Zhang, C. Guesgen Yeap, W.(Ed.), pp.901-908, Springer. (2004).

[7] Joseph P. Campbell, JR. "Speaker Recognition: ATutorial", Proc. OfIEEE, VOL.85, no.9, Sept. 1997

[8 ] A. E. Rosenberg and F. K. Soong, "Evaluation of a vectorquantization talker recognition system in text-independentand text-dependent modes," in Proceedings of IEEE Int.Conf. Acoust., Speech, and Signal Processing, pp. 873--880, 1986.

[9] L. Rudasi and S. A. Zahorian, "Text- independent talkeridentification with neural netowkrs," Proc. IEEE ICASSP,pp. 389-392, May 1991.

[10] A.P.Dempster, N.M. Laird, and D.B. Rubin. " Maximum-likelihood from incomplete data via the EM algorithm". J.Royal Statist. Soc. Ser. B., 39, 1977.

[11] Linde Y., Buzo A., Gray, R. "An Algorithm for VectorQuantizer Design." IEEE Transactions onCommunications. Vol. 28(1), 84-95.

[12] M. Cettolo and M. Vescovi, "Efficient audio segmentationalgorithms Based on the BIC," in Proceeding ofICASSP'03, 2003

[13] NIST (1990), the DARPA TIMIT Acoustic-PhoneticContinuous Speech Corpus.

[14] Rabiner L. and Sambur B. (1975) An Algorithm forDetermining the Endpoints of Isolated Utterances. TheBell System Technical Journal. 54, pp 297 - 315..

[15] Ling Feng, Lars Kai Hansen "a New Database forSpeaker Recognition" Informatics and MathematicalModelling IMM-Technical Report 2005-05-, TechnicalUniversity of Denmark 2005

0-7803-9521-2/06/$20.00 §2006 IEEE. 1 235

















https://www.researchgate.net/publication/4015349_Efficient_audio_segmentation_algorithms_based_on_the_BIC?el=1_x_8&enrichId=rgreq-e0dd3a3b6ef975cbd500b3668f08d451-XXX&enrichSource=Y292ZXJQYWdlOzIzNjI0MzA4NDtBUzoxMDQxOTUyNzc0NTk0NjVAMTQwMTg1MzQ5ODkxOA==


























Development and Evaluation of Automatic Speaker based- Audio Identification and Segmentation for Broadcast News Recordings Indexation

Documents