Masters Thesis: Speaker Diarization for Broadcast News Audios

Speaker Diarization of BroadcastNews Audios

Submitted in partial fulfillment of the requirements

of the degree of

Bachelor of Technology and Master of Technology

by

Parthe Pandit

(Roll no. 10D070009)

Supervisor:

Prof. Preeti Rao

Department of Electrical Engineering

Indian Institute of Technology Bombay

2015

Dedicated to

George, Elaine, Kramer & Jerry

Parthe Pandit/ Prof. Preeti Rao (Supervisor): “Speaker Diarization of Broadcast News

Audios”, Dual Degree Dissertation, Department of Electrical Engineering, Indian Institute of

Technology Bombay, July 2015.

Abstract

Speaker Diarization is a multimedia indexing technology that makes use of audio information to

answer the question ”Who spoke when?” This thesis presents a step-by-step speaker diarization

system implemented in MATLAB that is evaluated using the Diarization Error Rate (DER) met-

ric. The proposed system, designed for segmenting audio recordings of broadcast news, provides

implementations of state-of-the-art i-vectors as well as the traditional GMM speaker models. A

graphical clustering algorithm introduced by Rouvier et al. in 2013 has also been implemented.

This clustering algorithm offer lower DER as well as a computational advantage compared to

the conventional GMM based hierarchical agglomerative clustering. An unsupervised speech

activity detector (SAD) has also been developed that discards nonspeech in two stages - silence

removal followed by music removal. The music removal subsystem has been adapted to classify

speech segments with background music, e.g. news headlines sections, as speech. The proposed

SAD achieves a favourable performance on the January 2013 subset of the REPERE corpus

compared to the supervised SAD of the LIUM diarization toolkit.

Index terms: unsupervised, speech activity detection, MATLAB, ILP clustering, REPERE

iv

Contents

Dissertation Approval ii

Declaration of Authorship iii

Abstract iv

List of Figures vii

List of Tables viii

1 Introduction 1

2 Evaluation of Speaker Diarization 4

2.1 Diarization Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Datasets for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 State of the Art in Speaker Diarization 9

3.1 Feature Extraction for Speaker Diarization . . . . . . . . . . . . . . . . . . . . . 10

3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Metric based segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 Model based segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.1 Systems participating in NIST-RT evaluations . . . . . . . . . . . . . . . 16

3.3.2 Broadcast news systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.1 Speaker models for clustering . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.2 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 System Description and Evaluation 25

4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

v

Contents CONTENTS

4.2 Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Speech Activity Detection Algorithm . . . . . . . . . . . . . . . . . . . . . 27

4.2.2 Confidence measures for Speech Activity Detection . . . . . . . . . . . . . 29

4.3 Evaluation of Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Evaluation on the NDTV dataset . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.2 REPERE dataset results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Speaker Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 Choice of Segmentation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.6.1 Choice of speaker model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.6.2 Choice of clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.7 Evaluation of Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.7.1 HAC experiments with GMM speaker models . . . . . . . . . . . . . . . . 40

4.7.2 HAC with i-vector speaker models . . . . . . . . . . . . . . . . . . . . . . 40

4.7.3 ILP based experiments with GMM speaker models . . . . . . . . . . . . . 41

4.7.4 ILP clustering with i-vector speaker models . . . . . . . . . . . . . . . . . 42

4.7.5 Results on REPERE corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Conclusion and Future work 45

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Acknowledgements 53

vi

List of Figures

1.1 Applications of Speaker Diarization - Joke-o-mat . . . . . . . . . . . . . . . . . . 2

2.1 Rich transcription generated from a speaker diarization system . . . . . . . . . . 4

2.2 Example diarization error rate calculation . . . . . . . . . . . . . . . . . . . . . . 6

3.1 A typical speaker diarization system . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Bayessian Inference Criterion for change detection . . . . . . . . . . . . . . . . . 13

3.3 Sliding window search for speaker change detection . . . . . . . . . . . . . . . . . 14

3.4 Growing window search for speaker change detection . . . . . . . . . . . . . . . . 15

3.5 Model based segmentation for SAD system . . . . . . . . . . . . . . . . . . . . . 17

3.6 Hierarchical agglomerative clustering . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Block diagram of proposed MATLAB system . . . . . . . . . . . . . . . . . . . . 25

4.2 Silence removal for Speech Activity Detection . . . . . . . . . . . . . . . . . . . . 28

4.3 Music removal for Speech Activity Detection . . . . . . . . . . . . . . . . . . . . 30

4.4 Extraction of i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 ILP Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.6 HAC clustering with GMM speaker models . . . . . . . . . . . . . . . . . . . . . 40

4.7 HAC clustering with i-vector speaker models . . . . . . . . . . . . . . . . . . . . 41

4.8 ILP clustering with GMM speaker models . . . . . . . . . . . . . . . . . . . . . . 42

4.9 Dimension of Total Variability space . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii

List of Tables

1.1 Comparison of audio domains in speaker diarization research . . . . . . . . . . . 3

2.1 Simplification of diarization error rate calculation . . . . . . . . . . . . . . . . . . 5

2.2 Example of annotation and hypothesis segmentation for DER . . . . . . . . . . . 6

2.3 Annotated shows in REPERE corpus and their respective times . . . . . . . . . . 8

4.1 Size of GMM for speech and silence model . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Shape of covariance matrix of GMM for speech activity detection . . . . . . . . . 32

4.3 Cascade system for speech activity detection . . . . . . . . . . . . . . . . . . . . 32

4.4 Refining music model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 Results for REPERE with 60 hours annotated . . . . . . . . . . . . . . . . . . . 33

4.6 DER with ORACLE experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.7 DER with best clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.8 Comparison of clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . 44

viii

Chapter 1

Introduction

The number of multimedia uploads on the internet every day is ever-increasing, especially after

smart phones and smart devices have gained popularity in the recent years. Multimedia pow-

erhouses and online search engines get trillions of search queries every day. Search engines now

also facilitate searching through audio, image and video databases. Keeping track of all the data

and storing it efficiently for possible future access is becoming more and more important. Hence

indexing techniques in multimedia formats are getting attention. Audio segmentation is one

such indexing technique where the content in a collection of multimedia recordings is organised

based on the semantic information provided through their audio data.

Speaker diarization is one such audio indexing problem where the question being asked to the

machine is “Who spoke when?” Speaker diarization acts as a precursor to many other speech

technologies. For instance performing automatic meeting transcriptions, automatic stenogra-

phers, Dictaphones all would function faster and more efficiently if the machine knows the

active speaker at any given instant.

Satirical TV shows that also sometimes do investigative journalism. The Last Week Tonight

by HBO for example, make remarks about current issues and provide clips from the past audio

and video databases that contain the relevant content [1], which in fact is a major reason for

the popularity of the show. This task of going through previous content is tedious to be done

manually and is simplified manifold if the multimedia content is well organised and indexed.

Similarly while going through repeat telecasts of talk shows or comedy shows, a user would like

to navigate through their interesting parts only. Interestingly, a research group at University

of California, Berkeley made an online software called the Joke-o-mat [2] for fans of the famous

sitcom Seinfeld. It provides a temporal labeling for all the episodes of the show so that fans can

enjoy re-runs of the show looking only at the punch lines. Indexing of sports matches can also

enable finding out where a goal was scored or a batter hit a home run and thus enhance user

1

Chapter 1. Introduction 2

Figure 1.1: Screen shot of the Joke-o-mat java applet which lets users browse throughepisodes of Seinfeld.

experience. The problem of speaker diarization has many interesting applications

The research problem of speaker diarization has been applied in various domains – Telephone

conversations, meeting recordings and broadcast news archiving. Although the problem remains

the same, the design criteria for each of these are different due to the nature of the audio

recording and the variability they possess. The peculiarities of a typical recording from each of

the 3 domains are comparatively mentioned below. The focus of this thesis is on solving speaker

diarization for the broadcast news domain.

A system has been developed in MATLAB that was tested on the two datasets – (i) the

NDTV dataset which is a set of broadcast recordings of Indian English news readings and (ii)

the REPERE corpus from the French REPERE broadcast news people discovery campaign

competition. The system has 3 decoupled blocks viz. the speech activity detector, the speaker

change detector and the speaker clustering block.

The rest of the thesis is organised as follows. Chapter 2 reviews the evaluation techniques

for speaker diarization systems and gives a summary of the two datasets on which the system

2

Chapter 1. Introduction 3

was evaluated. Chapter 3 breaks down the problem of speaker diarization into signal processing

and statistical machine learning problems and reviews speaker diarization systems mentioned

in literature to study the algorithms and techniques used therein. Chapter 4 gives a detailed

description of the speaker diarization system that has been developed in MATLAB and does an

evaluation of the available subsystems using metrics mentioned in Chapter 2. The last chapter

makes conclusions about the methods implemented in the proposed system and makes a few

remarks about future development of the system.

Broadcast news Meeting conversations Telephone conversationsLonger durations ofuninterrupted speech

Shorter durations ofuninterrupted speech

Shorter durations ofuninterrupted speech

Negligible speaker overlap Higher speaker overlap Moderate speaker overlapPresence of music, jingles,variety of background noise

Uniform backgroundconditions

Uniform backgroundconditions

Dominant speaker (anchor) No dominant speaker No dominant speakerNumber of speakersunknown

Number of speakersunknown

Number of speakersknown (generally 2)

Table 1.1: Design criteria for speaker diarization systems for broadcast news, meetingrecordings and telephone conversations.

3

Chapter 2

Evaluation of Speaker Diarization

A diarization system has to answer the question “Who spoke when?” without any a-priori infor-

mation about the speakers present in the audio recording. The output that is expected from the

system is of the form shown in Figure 2.1. In particular, note that speakers segments are not

expected to be labelled by their name, only by a unique speaker id, which is indicated by colour

for the recording in the figure. Speaker diarization is thus different from speaker verification or

speaker recognition where prior information for target speakers may be made available to the

system beforehand in the form of speaker models or speaker biometrics.

2.1 Diarization Error Rate

Meeting Diarization

The evaluation of a diarization system is done using a metric called the Diarization Error Rate

(DER) [4], which is the percentage of the time of the audio for which the speaker was wrongly

labelled. The output of the system is compared with a segment level manually annotated

Figure 2.1: Rich transcription generated from a speaker diarization system [3]

4

Chapter 2. Evaluation of Speaker Diarization 5

temporal transcription indicating the speaker labels. The formula 2.1 was provided for the

NIST RT speaker diarization evaluations 2005, 2006, 2007 and 2009. In these evaluations,

competitor systems had to perform speaker diarization on meeting conversation recordings. In

the evaluation formula, for a segment s of duration dur(s), Nref and Nhyp are the number of

speakers indicated by the annotations and hypothesized by the system respectively, and Ncorrect

is the number of speakers in segment s that were a correct match between the annotation and

hypothesis.

DER =

∑Ss=1 dur(s) · (max(Nref , Nhyp)−Ncorrect)∑S

s=1 dur(s) ·Nref

(2.1)

Broadcast News Diarization

In broadcast news, there is very little overlap. For the case where overlap is absent, the formula

for DER can be simplified. This error calculation is used in the broadcast news diarization

systems. Pyannote [5] is a python based toolkit that facilitates calculation specifically for such

systems

Total speech time (S) = Total non-speech time (NS) =Correctly Labelled Speech time (C1) +Missed Speech time (T1) +Incorrectly Labelled Speechtime (T3)

Correctly Labelled Non-speech time (C2) +False Alarm Speech time (T2)

Table 2.1: Simplification of diarization error rate calculation

The DER can be broken down systematically into 2 types. Consider an audio with S sec

speech and NS sec non-speech as indicated by the annotation. Non-speech includes silences,

speaker pauses, music, jingles, noise etc. The two categories can be further classified exhaustively

as shown in Table 2.1. Missed speech time is the time when the algorithm erroneously indicated

a segment as non-speech. False alarm speech time, on the other hand, is the time when the

algorithm erroneously indicated a segment as speech. These 2 errors occur during the speech

activity detection, which is a pre-processing step in almost all diarization systems. They are

numbered E1 = T1x100/S and E2 = T2x100/S. E1 is called missed speech rate (MSR) and

E2 is called false alarm speech rate (FASR)

Error E3=T3x100/S is partly contributed by errors in both the speaker segmentation and

the speaker clustering, and is often termed speaker error (SPK ERR). A speaker change if missed

during segmentation causes misclassification of the shorter of the two segments. If the system

segments the audio into more number of segments than indicated by the annotation, often

called oversegmentation, there is a chance to make up for the oversegmentation error during the

5


Figure 2.2: Example DER calculation. (a) annotation, (b) hypothesis [5]

clustering step by merging neighbouring segments together. However if the segment happens to

be small, the possibility to error increases due to less data to capture the speaker information in

the segment. Erroneously clustered speaker segments are the main reason for E3. Finally, DER

is the total error in the systems hypothesis.

DER = (E1 + E2 + E3)

Calculating SPK ERR makes use of the Hungarian algorithm to perform a matching between

algorithm labels and annotation labels. It iteratively finds the best matching cluster pair at each

step based on maximum overlap between the 2 sets of labels. Finding the optimal mapping is

needed because the system does not need to identify speakers by name and therefore its speaker

labels will differ from the labels in the reference transcript. An example DER calculation is

shown in table 2.2. Note that nonspeech segments are not labeled here. But the reference

annotation indicates nonspeech between 10s-12s, 20s-24s and 27s-40s

Annotation Hypothesis[Segment(0, 10)] = ’Alice (A)’ [Segment(2, 13)] = ’a’[Segment(12, 20)] = ’Bob (B)’ [Segment(13, 14)] = ’d’[Segment(24, 27)] = ’Alice (A) ’ [Segment(14, 20)] = ’b’[Segment(30, 40)] = ’Charlie (C)’ [Segment(22, 38)] = ’c’

[Segment(38, 40)] = ’d’

Table 2.2: Example of annotation and hypothesis segmentation for DER

2.2 Datasets for evaluation

Diarization experiments for the system developed in MATLAB was tested using two datasets -

the NDTV dataset which has been manually annotated and the REPERE dataset which was

6


used in the REPERE video diarization campaigns. The REPERE corpus was annotated during

the competition during 2012-2014.

NDTV dataset

The development dataset used for the system consists of 22 episodes of the Hindu news Headlines

Now show from the NDTV news channel. It consists manual annotations of 4h15m of English

news reading with Indian accent. The anchor is the dominant speaker (maximum active time)

in the episodes. The anchors are different across episodes. The dataset also has silence segments

with lengths varying from 1s to 5s. No advertisement jingles are present in the dataset although

the headlines of the show are announced with music in the background that is common across

episodes. In cases where active speaker is a field correspondent background noise is also present.

In the manually generated annotation the speakers in a single episode are labeled with the

following information – gender of the speaker, background environment (clean /noisy /music),

speaker ID in the episode (indicating anchor separately) in that order. The nonspeech labeled

as silence, noise, speaker pause or music consists of 7% of the total recording. Speaker overlap

has been annotated with most dominant speaker in the overlap.

REPERE dataset

The REPRE dataset was used in the French Video Diarization campaign in 2012-2014. The

French REPERE challenge was a research evaluation competition aimed at systems performing

multimodal person discovery in video recordings of broadcast news. The systems participating

had to answer (i) Who is speaking (ii) Who is present in the video (iii) What names are cited

and (iv) What names are displayed? [6]

The original REPERE corpus is composed of various TV shows from two French TV channels

that were diverse in their content - news, debates, celebrity interviews etc. It has been distributed

by ELDA (Evaluation and Language resources Distribution Agency). 60 hours of data has been

manually annotated. These annotations provide identity-labeled speech turns. The nonspeech

segments which consist of lots of advertisements and show specific music events have not been

annotated. The nonspeech duration out of the total 60 hours is 5%. Speaker overlap is annotated

with every speaker present in the overlap. The annotation is in the standard RTTM format [7].

The list of TV programs and the annotated times are mentioned in table 2.3

7


TV Programme Duration TV Programme DurationBFMTV BFMStory 20h50m LCP EntreLesLignes 5h20mBFMTV CultureEtVous 2h45m LCP LCPInfo13h30 10h03mBFMTV PlaneteShowbiz 2h15m LCP LCPInfo20h30 0h40mLCP CaVousRegarde 5h25m LCP TopQuestions 6h35mLCP PileEtFace 5h20 LCP LCPActu14h 0h30mBFMTV RuthElkrief 0h21m TOTAL 60h04m

Table 2.3: Annotated shows in REPERE corpus and their respective times

8

Chapter 3

State of the Art in Speaker

Diarization

The problem of speaker diarization involves answering “Who spoke when”. It is generally broken

down into answering “is anyone speaking?” and then answering “which speaker in the audio is

speaking?” The first step is called speech activity detection, which is a pre-processing step

common in speaker recognition, speech recognition, speech coding and speech enhancement

[8]. The latter problem can be approached as finding the change in speaker (called speaker

segmentation) and then combining the contiguous segments belonging to the same speaker under

a unique label (called speaker clustering).

Initially in the late 1990’s, when research in diarization was still in its nascent stages, few

systems attempted to perform speech activity detection as a by-product of the segmentation and

clustering [8]. Nonspeech was thought to be just another speaker. But owing to the acoustic

variability of nonspeech, systems with explicit speech activity detectors performed much better.

Often, the speaker segmentation and speaker clustering are performed iteratively and hence

shown as a single block [9] as in figure 3.1.

In this chapter previously used methods in speaker diarization have been reviewed and the

state-of-the-art algorithms implemented by various systems specialized in diarization of broad-

cast news, meeting recordings and telephone conversations are compared. In the recent years,

the National Institute of Science and Technology (NIST), USA have organised rich transcrip-

tion tasks for broadcast news and telephone diarization (2003-‘04) and for meeting diarization

(2005, ‘07, ‘09). The Albayzin campaign of 2010, the ESTER (2008) [10] and REPERE (2012-14)

broadcast audio and video diarization campaigns have fueled research in broadcast news diariza-

tion and attracted developers to participate with their diarization engines to set up benchmarks.

Some of these competitor systems have also been reviewed in this chapter.

9

Chapter 3. State of the Art in Speaker Diarization 10

Figure 3.1: A typical speaker diarization system

3.1 Feature Extraction for Speaker Diarization

For the task of speaker diarization, acoustic features that discriminate speaker information in the

spectrogram but are invariant to the phone sequence being uttered are desired. Mel-frequency

cepstral coefficients (MFCCs) or Perceptual Linear Prediction (PLP) coefficients, although not

designed to distinguish between speakers, have been used widely in the areas of speaker verifica-

tion and speaker recognition. Since a similar task of modelling speaker information is tackled in

speaker diarization, MFCCs and other cepstral features are the most commonly used features.

During speaker segmentation 12-19 MFCCs have been used along with the short time energy,

while during clustering usage of higher order derivatives of these MFCCs has been reported

[11]. LFCCs extracted using a linear filter bank instead of the Mel scale filter bank [12] and

Linear Prediction Cepstral Coefficients (LPCCs) [13] have also been tested but no conclusion

has been reached regarding the better performance of either. Typical sizes of analyses windows

are 25-30ms with frame hops of 10ms.

For speech activity detection, acoustic features that discriminate between speech and non-

speech are sought after. Features such as energy [13], zero-crossing rate, spectral centroid,

spectral roll-off and spectral flux [14] have been used previously in speech activity detection.

However the use of these feature vectors has always been seen in concatenation with cepstral

features.

Other than the above mentioned short time analysis features, 4Hz modulation frequency

features that convey long term characteristics of the acoustic signal have also been investigated

10


[15], and have been applied in the speaker overlap detection and speech activity detection. A

major challenge faced in these features though is the high dimensionality of the features and

the computational cost associated with it. Long term cumulative features drawn over texture

windows of 500ms such as median of pitch, long time average spectrum, deviation of the 4th

and 5th formants, harmonics to noise ratio, formant dispersion etc. have shown to be of use for

fast cluster initialization [9], while features providing vocal source and vocal tract information

[16] have shown better speaker discrimination when used along with MFCCs.

Recently Slaney et al. used features derived as activations of the bottleneck layer of a

neural network. The artificial neural network was trained to discriminate 500ms segments as

belonging to same or different speaker [17]. In another work [18], a 50% relative improvement was

reported for speech activity detection on a large Youtube corpus when a two dimensional soft-

max activation of a deep neural network was concatenated with 13 MFCC. Another interesting

feature space explored in 2011 have sacrificed diarization error only slightly to obtain a 10x speed-

up using binary valued features for performing clustering [19]. In this work, acoustic MFCC

features of segments are transformed into a binary feature space using likelihoods obtained from

GMMs.

3.2 Segmentation

In audio segmentation, the task is to create homogeneous and contiguous chunks of audio that

show dissimilarity from its neighbouring segments. It is also called acoustic change detection.

We will look at two approaches to audio segmentation with more focus on methods used in

speaker segmentation applied to speaker diarization.

3.2.1 Metric based segmentation

One of the most common audio segmentation methods to date is metric-based segmentation.

These methods are very popular in music segmentation tasks as well. In metric based segmenta-

tion a distance metric is first defined between two audio segments, that indicated their similarity.

Then a change detection strategy is implemented using this metric. Compared to model based

methods, these methods have great advantage since they do not need any information about the

data a priori.

For music segmentation, distances are calculated between the feature directly. In speech

processing however, the features (generally cepstral features) used are not suitable for frame-

wise distance computation for comparing speaker similarity, due to their variability with the

11


phones uttered. To aggregate speaker information from longer segments, it is assumed that

the features of every segment come from a probability distribution. Distance comparison is

done between these probability distributions using statistical similarity measures such as the

KL divergence, Cross Likelihood Ratio, Bayesian Inference Criterion etc. The most commonly

used probability distribution for modeling chunks of feature vectors during speaker segmentation

is the full covariance multivariate Gaussian distribution.

Bayesian Inference Criterion

Bayesian Inference Criterion (∆BIC) is a model selection criterion i.e., is a statistical criterion

that compares available models for representing the data. The aim during this selection is to

calculate if there is any over-fitting. For a set of vectors X, the BIC of the model M is one such

criterion defined as:

BIC(X,M) = logP (X|M)− λ ·#(M) · logN

The first term calculates the likelihood of the data given the model, whereas the latter term

penalizes it proportional to the number of parameters #(M) that the model uses and the size

of available data on which the model was trained. The second term is called the complexity of

the model.

The BIC can be applied to indicate whether the two sets of feature vectors being compared

for similarity are drawn from the same distribution or from different distributions. To measure

similarity between blocks X1 and X2 the following hypotheses need to be compared: H0: The

feature vectors from X1 and X2 are from same distribution and H1: The feature vectors from

X1 and X2 are from separate distributions. Let the model for H0 be M which would be trained

on X i.e., X1 concatenated with X2 and let the models for H1 be M1 and M2 for X1 and X2

respectively. We define

∆BIC = BIC(M)−BIC(M1)−BIC(M2)

A positive value for ∆BICsuggests dissimilarity between the two blocks X1 and X2 and hence

indicates that there is a change between segments X1 and X2.

Chen et al. built a completely unsupervised system using ∆BICand their method has been

replicated a number of times for speaker/ environment change detection [21] in the speaker

diarization domain. Improvements were made by [22] and [23] to make faster implementations

of change detection using the BIC approach by reducing the number of computations with some

compromise on accuracy.

Metric based segmentation methods are implemented in two strategies, a fixed sliding window

12


Figure 3.2: BIC for change detection [20]

strategy and a growing window search strategy. In the former, there is a window of fixed size,

the centre of which is being inspected for a change [24]. If the feature vectors on either side of the

midpoint are better modeled by separate distributions, resulting in a higher distance between

distributions, the midpoint is declared as a change point. The size of the sliding window is

typically 5s and the two 2.5s segments are compared for a similarity. With a larger window size

the two segments would be modeled better. However it would have higher chances of missing

a change if more change points enter the window under consideration, since the probability

distribution estimated on the segment may get contaminated.

When implementing the growing window search strategy, a single change is pursued from

the start of the recording in a window of certain size generally about 5s. If no change is detected

in this window, the size of the window is increased and a change is searched in the new window.

After the first change is detected, the search is reset from the last change detected. The growing

window method has been reported with the BIC metric [21]. In recent years, the [25] and [26]

systems have replicated the growing window BIC segmentation followed by a BIC clustering

that merges only consecutive segments to reduce false alarm speaker changes.

3.2.2 Model based segmentation

Model based segmentation methods train a GMM for every segmentation class. These GMMs

are used as a PDF in a hidden Markov model (HMM) where each state is connected to every

other state with equal transition probability. The Viterbi decoding using this HMM gives a

segmentation of the audio recording. A major disadvantage of the model based segmentation

is that the GMMs needs to be known before hand, hence need some external training data.

13


Figure 3.3: Sliding window search for speaker change detection. Distance is computedbetween two halves of the sliding window and plotted with time. Peaks in the distanceindicate a change [23].

14


Figure 3.4: Growing window search for speaker change detection. Every search is for asingle change point. Search is reset after the change is found [23].

However when segmentation and clustering are performed iteratively, the output produced by

the speaker clustering algorithm gives a set of speaker segments as training data for GMMs of

the next iteration to refine speaker segmentation. A pre-clustering is often performed to get

an initial grouping of audio segments [9] with each grouping showing resemblance of speaker

information.

Often model based segmentation methods have been used only as a post processing step to

achieve a refined segmentation [27]. Such model based techniques are famous in segmentation

during speech activity detection [25] where the acoustic change being looked at is between speech

and nonspeech. In some telephone diarization systems [20], there has been a pre-segmentation

of the audio recording based on bandwidth and gender using GMMs trained for each of the 4

classes (2 bandwidths x 2 genders).

The model based segmentation methods are more famous in meeting diarization systems.

With the advent of better clustering algorithms with i-vector speaker models, the focus has

shifted to performing speaker segmentation and speaker clustering separately for diarization.

However use of the GMM-HMM framework for refinement is still popular [25]

15


3.3 Speech Activity Detection

The task of finding contiguous segments of speech in an audio and segregating them from other

types of sounds is called speech activity detection (SAD). It is beneficial for speech processing

systems since it is practical to process only speech segments rather than entire recordings.

It makes a design more efficient by saving computation time and resources. Apart from the

computational advantages, the absence of an SAD often causes insertion errors in ASR systems.

Hence speech activity detection is a fundamental task in almost all fields of speech processing -

coding, enhancement and recognition [8]

In speaker diarization, the error metric itself highlights the need for a speech activity detec-

tor since missed speech and false alarm speech are included in the diarization error rate metric.

Moreover, with limited speaker data from small speech segments, presence of non-speech con-

taminates the estimated speaker models thereby affecting the performance of the diarization

system. Initial approaches to diarization tried to let SAD be a by-product of the diarization

system [8] by letting nonspeech be a single cluster which would be discarded at the end. However

it was soon noticed that systems having an explicit SAD gave better results.

SAD is often performed using frame-wise classification. Statistical models are trained and

estimated on a feature space most suitable for discriminating the speech and nonspeech classes.

In most cases, Gaussian mixture models are the statistical models used and the feature space

is in most cases cepstral features. Some works have reported use of acoustic features such as

energy [13], zero-crossing rate [28], spectral flux [14]. A few speech activity detectors that were

used in previous diarization competitions and campaigns for both the meeting and broadcast

news domain have been reviewed in subsections 3.3.1 & 3.3.2.

3.3.1 Systems participating in NIST-RT evaluations

The NIST organised rich transcription evaluations which are now the current benchmark in

meeting diarization. These benchmarks consist of results obtained by four participant systems

[8]. Typically 1-3% missed speech error and 2-4% false alarm speech rates are the state-of-the-art

in speech activity detection.

The SHoUT diarization toolkit for SAD [29] uses a bootstrap segmentation performed using

speech and nonspeech models pre-trained on a dataset for Dutch broadcast news. It is followed

by an iterative classification using a Viterbi decoder on 1 HMM with 2 states representing speech

and nonspeech. The use of an HMM allows to control the minimum duration of the speech and

nonspeech thereby preventing sporadic transitions from one class to another. The system uses

16


Figure 3.5: Model based SAD from SHoUT toolkit [29] using a GMM-HMM system

17


12 MFCCs concatenated with zero crossing rate and their first and second derivatives in a 39

dimensional feature vector. The system was used by ICSI [9] and LIA-Eurecom[12], although

the feature vectors used by the latter team for the iterative classification consisted of linear

frequency cepstral coefficients (LFCC).

The UPC system [30] made use of modified support vector machines (Proximal SVMs) with

Gaussian kernels to segregate the speech and nonspeech in the audio. The modification allowed

for faster retraining of SVMs as suited for an iterative classification.

The IIR-NTU [13] system performed a bootstrap segmentation based on an energy derived

confidence score. An iterative classification with GMMs trained for speech and nonspeech, using

high confidence frames from the bootstrap segmentation, followed the initial segmentation to

refine the speech and nonspeech classes. The authors reported use of Linear Prediction Cep-

strum Coefficients [13] for both the bootstrap segmentation and the iterative classification. This

approach was completely independent of external training data for the speech and nonspeech

models

3.3.2 Broadcast news systems

In the LIUM diarization toolkit [25], the authors developed a model based segmentation system

for speech activity detection using an 8 state HMM with 2 states of silence (wide and narrow

band), 3 states of wide band speech (clean, over noise or over music), 1 state of narrow band

speech, 1 state of jingles, and 1 state of music. Each state is modeled with a GMM of size 64 of

MFCCs, their deltas and delta-deltas. All the models were trained using the extensive data for

each model from the ESTER1 dataset. This system resulted in a 1.1% false alarm speech and

3.9% missed speech on the dev0 subset of the REPERE corpus. Their results on other databases

ESTER2 and ETAPE are reported in [25]. Besides speech activity detection the LIUM toolkit

also performs a gender and bandwidth detection. This also uses a model based segmentation

with 128 sized diagonal GMMs for each of the 4 classes (2 genders x 2 bandwidths) and a feature

warping.

The Albayzin 2010 campaign saw five competing systems. The best results for SAD were

reported by [14]. Although the DER was much worse than others (55% DER), their SAD error

stood best at 3.4% (1.1% missed and 2.3% false alarm). The authors reported using multi-layer

perceptrons instead of GMMs to model emission probabilities of a 5 state hybrid NN-HMM

system. The feature space was also expanded. 16 MFCCs concatenated with 8 other audio

features - energy, zero-crossing rate, spectral centroid, spectral roll-off, maximum normalized

correlation coefficient and its frequency, harmonicity measure and spectral flux. Information

18


regarding other participating systems in the Albayzin campaign is mentioned in [10].

In the REPERE 2012-2014 evaluations, three consortia took part - SODA, QCompere and

PERCOL [31]. The SODA consortia used the LIUM toolkit above. The QCompere system had a

4 state HMM similar to the LIUM toolkit one state each for speech, silence, noise and music [32]

modelled by GMMs of size 64. The PERCOL system [33] performed a 3 class GMM based SAD.

Interestingly their 3 classes were non-speech, overlapping speech and non-overlapping speech,

each modeled by 256 sized GMMs trained from the ETAPE corpus. The overlap detection

reportedly also improved the DER than the baseline clustering system.

3.4 Clustering

Clustering is a common problem in statistical data analysis. It has been addressed in many

scientific fields right from exploratory data mining to community detection in social networks. It

is the process of grouping a set of objects such that objects in each group, called cluster, are more

similar to each other than they are to objects in other groups or clusters. The objects could be

points in a vector space or even statistical models. The similarity mentioned above is a distance-

like measure defined between the objects by the user. The word similarity is used because the

measure defined need not satisfy all the properties of a norm viz. non-negativity, triangle

inequality and symmetry. The words similarity and distance have been used interchangeably

here, with less distance meaning more similarity and vice versa.

The process of clustering is generally translation invariant and hence the relative position

of the objects in their space is more relevant rather than the objects themselves. Indeed this

relative position of the objects is indicative of pairwise similarity. For the problem of speaker

diarization, the aim is to perform clustering of segments of audio based on the active speaker in

each segment. Each cluster should ideally represent a single speaker.

The dimensionality of the spectrogram of a single segment is large and comparison between

these segments based on their spectrogram is not computationally viable. Hence the segment

needs to be quantified in a low dimensional space to compare the similarity of their speaker

information. A few speaker models that have been utilized in the past in the fields of speaker

verification and speaker recognition have been reviewed in section 3.4.1. Every speech segment

would have a representative vector or a statistical model which is characterizes the speaker

information. Clustering is performed on these speaker models. Section 3.4.2 reviews a traditional

clustering algorithm and a state-of-the-art algorithm based on a graphical approach.

19


3.4.1 Speaker models for clustering

Since speaker diarization needs to capture speaker information from audio segments, speaker

models commonly used in speaker verification and speaker recognition are adopted. The two

main speaker models GMMs and i-vectors are explained below. Off these the i-vectors have

recently become state of the art in speaker verification tasks.

Gaussian Mixture Models

Gaussian Mixture Models (GMMs) of cepstral features are often used to model speakers. A

Gaussian mixture model is a popular tool for modeling multi-modal data and possess the fol-

lowing form.

p(x) =N∑i=1

wiN (µi,Σi, x) s.t.

N∑i=1

wi = 1 (3.1)

Since the segment durations can be small, the number of feature vectors available from

a segment is sometimes insufficient to estimate a full Gaussian mixture model. To overcome

this problem pre-trained Universal Background Model (UBM) is adapted for the segment to

obtain its speaker model [34]. The UBM is a comprehensive model for data from multiple

speakers combined together that captures variability of speech. For GMMs of cepstral features,

different statistical similarity measures have been investigated earlier such as the symmetric

KL divergence, normalized cross likelihood ratio (NCLR) etc. [35]. The KL divergence is an

information theoretic measure of how different the two probability distributions are from each

other, while the cross likelihood ratio compare P (X1|M2) and P (X2|M1) (equation 3.2 & 3.3).

CLR(X1, X2) = logP (X1|M1)

P (X1|M2)+ log

P (X2|M2)

P (X2|M1)(3.2)

NCLR(X1, X2) =1

|X1|log

P (X1|M1)

P (X1|M2)+

1

|X2|log

P (X2|M2)

P (X2|M1)(3.3)

where Mi is the model estimated on Xi. As we can see, if feature vectors of segment X1 and

X2 come from the same speaker, X1 it fits the model of segment X2 well, so the cross likelihood

increases, decreasing the distance.

Recent experiments in the SV and SR fields noted that only the means of gaussians in a GMM

contain most speaker related information. Due to the high variability of the covariance matrices

and mixture weights with respect to utterances [34], they are not reliable indicators of speaker

information. Hence, instead of calculating the above likelihood scores, the means of a GMM are

20


concatenated to get a single vector (called the GMM supervector) in a high dimensional vector

space. Distance measures such as the cosine distance and Mahalanobis distance [36] have been

investigated on this space. To make a comparison between two GMM supervectors they need to

be adapted from the same UBM to make sure that the corresponding mean vectors of the GMM

are being compared between segments. The adaptation algorithm is detailed in [34].

I-vector representation

The concept of i-vectors was first introduced in speaker verification as a feature extraction

from GMMs to reduce the dimensionality of the GMM hyper parameters. With the UBM sizes

being of the order of 512, 1024 or even 2048 Gaussians in some GMM systems, the size of the

supervector becomes very large to do further computation on the supervector. Instead, using

factor analysis for reducing the dimensionality of the supervector led to a new representative

vector with a few hundred dimensions. This subspace, called the total variation subspace, is

hypothesized to contain spectral information of the speaker and background.

m = M + Tx (3.4)

where m is the mean adapted supervector of the utterance for which the i-vector x is sought. M

is the mean supervector of the UBM. The matrix T is a tall low rank matrix representing the

total variability subspace which needs to be learned on a training dataset. Although supervectors

typically have tens of thousands of dimensions, this representation constrains all supervectors

to lie in an affine subspace of the supervector space. The dimension of the affine subspace is at

most a few hundred.

i-vector extraction requires speaker labeled training data with multiple utterances of the same

speaker with possible variations in utterances in terms of their phonetic balance and background

noise. The training algorithm for the total variability subspace [36] and the i-vector extraction

from the Baum-Welch statistics of the utterance have been implemented in the MSR Identity

toolkit [37].

3.4.2 Clustering Algorithms

Given the similarity matrix between the speaker GMMs or i-vectors, a clustering algorithm aims

at reaching the best set of clusters with minimum intra-cluster variance and maximum inter-

cluster variance We will look at two clustering algorithms used previously in diarization – (i)

hierarchical agglomerative clustering (HAC) and (ii) integer linear program (ILP) clustering,

21


Figure 3.6: Hierarchical agglomerative clustering

which use different solving techniques and also have different criteria for arriving at the the best

set of speaker clusters.

Hierarchical Agglomerative Clustering

HAC is a greedy algorithm i.e. it makes a locally optimal choice at each stage with the hope of

finding the global optimum.In an iterative process, the 2 most similar clusters are merged into a

single cluster. The number of clusters reduces by 1 at each step. This iterative process continues

until only one cluster remains. While merging 2 clusters, the data from segments corresponding

to the 2 clusters is concatenated and a single speaker model is re-calculated on it. The distances

of every other cluster with this newly formed cluster are re-calculated to update the similarity

matrix for the next step.

Step 0: Calculate similarity matrix (Xi,Xj)

Step 1: Find the i∗ and j∗ such that i∗ 6= j∗ & (Xi∗ , Xj∗) = mini,j(Xi, Xj)

Step 2: (Merge step) Replace Xi∗ and Xj∗ by a single object Xk∗ . k∗ = min(i∗, j∗)

Step 3: Update similarity matrix (Xi,Xk∗).

Step 4: If number of clusters > 1, Go to step 1.

Step 5: Calculate best set of clusters using optimality criterion

The optimal set of clusters is chosen based on an optimality criterion. One optimality

criterion is to choose the set of clusters where the minimum inter cluster distance is greater

22


than a threshold. Another criterion was proposed by Nguyen [38] in which from among the

clusters from every iteration, the set of clusters where the histograms of intra-cluster distances

and inter-cluster distances are farthest from each other is chosen.

argmax k mini 6=j(X(k)i , X

(k)j ) ≥ θ (3.5)

where (X(k)i , X

(k)j ) is the similarity matrix in the kth iteration

argmax|minter −mintra|√

σ2interninter

+σ2intranintra

(3.6)

where minter, σinter and ninter are the mean, standard deviation and number of elements in the

inter-cluster distances and similarly for the intra-cluster distances.

The ICSI system [9] performed an initial clustering of segments of 1s duration using prosodic

long term features. This segmentation was then refined iteratively in a GMM-HMM framework

over MFCC feature vectors derived from segments within each cluster. Each state of the HMM

represented a speaker and was modelled by a GMM. The use of the HMM allowed adding a

constraint for obtaining contiguous speech turns of 2s. In each iteration, the number of clusters

was reduced by merging state GMMs in a HAC using the BIC distance. The IIR-NTU system [13]

used LPCCs to form 30 initial clusters from uniformly dividing the audio, and then iteratively

used the same GMM-HMM framework to perform HAC with CLR distance. The LIA-Eurecom

system [12] uses the same framework, although in their method, they took a top-down approach

(also called divisive clustering) instead of HAC and performed a splitting of states starting from

a single state. Many other systems have been implemented that use the HAC approach to

clustering with different distances [19, 39].

i-vector speaker models adopted from speaker verification were first used in speaker diariza-

tion of telephone conversations where the number of speakers in the recording was known a

priori and hence k-means clustering of i-vectors was performed to arrive at 2 clusters of i-vectors

[40] and [41]. Later, an HAC like clustering of i-vectors for broadcast news were reported in [42]

which demonstrated better performance over traditional BIC based GMM HAC architecture.

ILP based clustering

In an effort by Rouvier and Meigner in 2012, a global optimization approach was proposed to

perform speaker clustering [43] using an integer linear program (ILP). Clustering is posed as a

combinatorial optimization problem on a complete graph (each node connected to every other

node). The speaker segments are considered as nodes of the graph and the incidence matrix is

23


the similarity matrix. The integer linear program to find the optimal clustering is a variation

of the k-centres problem. In simple words, the k-centres problem is to choose K cities out of N

for building warehouses so that the worst case distance between a city and its closest warehouse

is minimized. The ILP is adapted for unsupervised speaker diarization since the number of

speakers (K) in not known a priori.

With the introduction of ILP clustering in the broadcast news domain, diarization systems

now typically perform segmentation and clustering separately [10], and perform a post processing

step of Viterbi decoding using the GMM-HMM framework. Recently in 2014, Meigner et al.

[44] improved upon the above mentioned ILP framework to reduce the redundancies in the

constraints, making clustering extremely fast. Hence, in the recent years, the ILP based methods

are in the limelight over the traditional HAC based clustering due to their better performance in

terms of speed and accuracy. The LIUM toolkit reported a 17.19% DER with GMM based CLR

clustering and 15.46% DER with i-vector based ILP clustering on the REPERE corpus [44].

Summary

In this chapter a few background concepts used in segmentation, clustering and speech activity

detection have been presented, and previously implemented techniques in speaker diarization

literature are reviewed. Other than the speaker diarization task, where the number of speakers

and the presence of specific speakers is not known a priori, there have been many specialized

audio indexing tasks that have been investigated in the past. For example explicitly detecting

the presence of music [45], helping find the structure of a broadcast program [46] or locating

commercials to eliminate unwanted audio [47]. In another work, making use of speech tran-

scriptions, fast speaker change detection was applied [48]. More generic systems specialized for

different domains include the Alize speaker recognition toolkit which has a diarization sub-block,

the SHoUT toolkit [29] that was designed for meeting diarization.. INRIA, IDIAP, and DiarTK

are some other toolkits that are still under development for diarization.

24

Chapter 4

System Description and Evaluation

A complete end-to-end system has been developed in MATLAB that performs speaker diarization

of audio recordings. This system has been tested and evaluated on data from broadcast news

recordings and debate audios from two corpora and using evaluation metrics described in Chapter

2. This chapter gives a complete description of the system right from the feature extraction to

error calculation. The choice of models and parameters is explained through the evaluations

section for each subsystem. The parameters were optimized for broadcast news using the NDTV

dataset as the development set. The system has also been tested on the REPERE corpus for

comparison with broadcast diarization systems previously mentioned in literature.

Meigner et al did a comparative analysis of two approaches for diarization and discussed

the pros and cons of both the step-by-step and integrated diarization systems [49]. In the

step-by-step approach to diarization, the diarization is performed in a single pass through the

speech activity detector, the speaker segmenter and the speaker clustering subsystems. In the

integrated approach to diarization the information from the clustering algorithm is used to refine

the speaker segmentation, and clustering is performed iteratively. With the advent of better

clustering algorithms with i-vector speaker models, the focus has shifted to performing speaker

segmentation and speaker clustering separately for diarization. However use of the GMM-HMM

framework for refinement is still popular [25] Figure 4.1 shows a block diagram of the proposed

system which follows the step-by-step diarization approach.

Figure 4.1: Block diagram of proposed MATLAB system

25

Chapter 4. System Description and Evaluation 26

4.1 Feature Extraction

MFCC features have been repeatedly used in diarization research for all subsystems including

speech activity detection, speaker segmentation and speaker clustering. The proposed system

uses the first 19 MFCC features for processing in all three subsystems. In addition to these

features, short time energy and zero crossing rate and their first and second order derivatives

have been used in the SAD. The speaker segmentation subsystem uses only the 19 MFCCs and

short term energy, whereas the clustering uses their first and second order derivatives as well.

The frame dimensions of the analysis windows are 30 ms with a 20 ms frame overlap.

4.2 Speech Activity Detection

Speech activity detection is the task of separating speech from nonspeech in an audio recording.

While designing a speech activity detector as a precursor to diarization, the challenges faced are

two-fold – achieving (i) minimum missed speech and (ii) minimum false alarm speech. Percent

speech misclassified as nonspeech by the SAD is call missed speech rate (MSR), whereas percent

nonspeech misclassified as speech is called false alarm speech rate (FASR). Indeed these two are

the evaluation metrics for SAD. Typically 1-3% missed speech error and 2-4% false alarm speech

rates are the state-of-the-art in speech activity detection. If the diarization system is designed

to act as a precursor for systems such as ASR or key-word spotting, missed speech errors would

lead to deletion errors in the ASR system, while the false alarm speech might lead to insertion

errors. For SAD as a block in diarization, false alarm speech often leads to contamination of

speaker models during clustering and segmentation and affects the clustering process.

A common approach in speech activity detection is to attempt to classify all types of sounds

that are present in the recording. If the data being diarized is known beforehand and has some

peculiarities, such as audio indicators for marking sections in an episode, it becomes possible

to train statistical models for these markers in the data and classification is straightforward [2].

Unfortunately while developing generic systems, we do not have the luxury of possessing such

markers or sound effects to be expected in the input audio. In this section, the problem of

estimating non-speech models for SAD with no prior information about the data is addressed.

Sounds other than speech which are often seen in audio databases are either human produced

such as fillers, lip-smacks, laughter, clapping etc. or instrument sounds such as music, jingles

etc. Silence regions and pauses taken by speakers also form significant portions of most audio

databases. These sounds together form the non-speech category. Model based approaches are

popular in SAD where statistical models are trained for speech and nonspeech from external

26


data. The drawback of these systems however is their reliance on acoustic conditions of out of

sample data. Hybrid systems [9, 12] make use of a classifier trained on external data to obtain

an initial bootstrap segmentation and then the models for speech and nonspeech are refined

iteratively over the audio to be segmented to adapt to the acoustic variations in nonspeech.

The bootstrap classification provides a subset of feature vectors of the recording being pro-

cessed that best represent the speech and nonspeech. Class models are initialized over these

token subsets on the feature space and frame-wise iterative classification is performed to refine

the classes. The bootstrap segmentation is generally performed on a smaller set of features which

are chosen based on heuristic information known about the speech and nonspeech classes.

4.2.1 Speech Activity Detection Algorithm

The speech activity detector in the proposed system is a model based classifier. It is independent

of external training data for modeling the nonspeech and speech classes. The approach to such

a model based speech activity detector is inspired by the SAD in the IIR-NTU submission to the

NIST RT2009 evaluations [13]. In our system speech activity detection is done in two decoupled

steps. First, silence is removed from the whole recording using an energy based bootstrapping

followed by iterative classification. In the second step, music and other audible nonspeech are

identified from the recording. For music removal the silence removed audio is fed to a music vs.

speech bootstrap discriminator. The frames of the audio which are music with a high confidence

are used to train a music model which is iteratively refined. In both steps, only segments with

duration 1s or longer have been labeled as nonspeech in order to avoid sporadic nonspeech

to speech transitions. This constraint are incorporated in [25] and [29] using a GMM-HMM

framework.

Silence Removal

The silence removal in the proposed system is done using 19 MFCC features concatenated with

short time energy(STE) and their first and second derivatives. A bootstrap segmentation assigns

a confidence value to every frame for both silence and speech classes. The bootstrap silence model

is trained using a Gaussian mixture of size 4 over the 60 dimensional feature space. A speech

model is also trained with the same size from high confidence speech frames.

In an iterative classification step, each frame is classified into two classes viz. speech and

silence. The high confidence speech and silence frames from these are used to train the speech

and silence models for the next iteration. As the number of iterations increase, the number of

60 dimensional Gaussians used to model the speech and silence GMMs are increased until a

27


Figure 4.2: Silence removal using energy based bootstrapping and iterative classification

28


maximum. The best results were obtained when the size of Gaussian mixtures was limited at 32

for speech and 16 for nonspeech. This results in removal of silences and pauses, but high energy

nonspeech, also called audible nonspeech such as and jingles and music are classified as speech,

since the MFCCs and frame energy for music resemble speech more than silence.

Music removal

In 2005 [28] introduced a model fitting based music v/s speech classifier that reported a clas-

sification accuracy of 95%. The authors pre-segmented the audio recording into chunks of 1s

and extracted 50 feature vectors over 20ms windows. These feature vectors were 2 dimensional

- (i) short time energy and (ii) zero crossing rate of the windowed signal. A histogram of short

time energy(STE) and zero crossing rate(ZCR) is computed for the 1s chunk and compared with

the model histograms of speech and music which were derived from a large database of music

data and speech data. These ideal histograms were modeled with χ2 distributions. The chunk

is labeled music or speech after a comparison between the histogram of the concerned 1s chunk

and the χ2 models.

The music speech discriminator [28] fails when speech and music are present together. In

news broadcast archives, it is often the case that the most information dense parts of the archive

such as episode headlines have a characteristic background music that is specific to the show.

Taking this into consideration, porting the system as is would not just result in missed speech,

but would cause loss of highly informative speech data. Hence we use the output of the classifier

as a bootstrap segmentation. Initial estimate models for music and speech are trained from

high confidence frames of both classes. An iterative classification similar to the silence removal

system is done to refine the speech and music classes so as to discard music only segments. The

features used are nineteen MFCCs concatenated with the zero crossing rate and their first and

second derivatives. The short time energy is not used during the iterative classification step. It

was observed that only after neglecting the short time energy the speech with background music

which was classified as music was recovered to the speech class.

4.2.2 Confidence measures for Speech Activity Detection

During the silence removal, a histogram of the energy of the frames is used to rank all frames

according to the energies. The frames with 20% lowest energies are called high confidence silence

frames whereas the frames with 10% highest energies are speech with a high confidence. Hence

in every iteration, only these frames are used for the training of the GMMs. For the music

removal, the aim is to rake out from the frames that have speech with music as background but

29


Figure 4.3: Music removal using a music-speech discriminator for bootstrapping

30


that are classified as nonspeech. Hence we only take the 40% highest zero crossing rate frames

from the ZCR histogram as high confidence music frames and train the music model.

4.3 Evaluation of Speech Activity Detection

The NDTV dataset was used as a development set to tune parameters for both silence and music

removal. These parameters were used to obtain results on the REPERE dataset as well.

In this section, results for 3 separate experiments has been shown. The effect of the size of

the GMM in the iterative clustering step of SAD, the shape of the covariance matrix of the GMM

and effect of cascading the 2 systems. Error has been obtained using the Pyannote [5] library in

python. Missed speech rate (MSR) is the percentage of time of the audio for which speech was

misclassified as nonspeech. Whereas, false alarm speech rate (FASR) is the percentage of time

the audio was misclassified as speech. The SAD error is the sum of the MSR and FASR.

4.3.1 Evaluation on the NDTV dataset

Since the two blocks of silence and music removal are decoupled, the set of system parameters

is chosen for the former and the output from this bock is fed to the music removal. Hence the

evaluation of the music removal is done separately.

Size of GMM in Iterative Clustering during Silence Removal

After the bootstrap segmentation, the models trained from limited speech and nonspeech frames

are GMMs of size 4. As the amount of data available to both the classes increases, the size of the

GMMs used to model the speech and silence is increased. In every iteration the size is doubled

until it reaches a maximum. In this experiment, we choose the size of the GMM that is best

suited for both the classes. The table shows % SAD error over 22 episodes of NDTV dataset.

Best results of were obtained with the 32GMM for speech and 16GMM for silence. For the

combination (32,64) the iterations did not converge and gave a high MSR in every iteration. A

possible justification is that the silence model is a mixed model with some Gaussian components

representing speech frames and others representing silence frames. For the combination (4,64)

the FASR was very high, since of the two competing models, the speech model always resulted

in a high likelihood for most frames and hence managed to capture silence through some of its

Gaussians.

31


Speech8 16 32 64

Silence

4 13.42 9.82 5.94 x8 10.62 5.79 5.54 5.8916 6.44 5.74 5.2 5.4932 13.5 6.52 5.96 x

Table 4.1: Size of GMM for speech and silence model

Covariance matrix for GMM in iterative classification of silence and music

removal

In this experiment, the GMMs for speech, silence and music were modelled using 32, 16 and 16

Gaussians respectively in the final iterations of the iterative classification. Making the GMM

full covariance allows the GMM to be better trained for the speech class. This results in very

low MSR, however for the silence and music class where the number of frames are limited, a full

covariance GMM becomes an overfit, hence the FASR goes up significantly.

Full DiagonalSilence removal 5.49 5.2Music removal 8.13 7.69

Table 4.2: Shape of covariance matrix of GMM for SAD

Effect of cascading Silence removal and Music removal

Using the music removal system alone gives a high false alarm rate. The cascade of these 2

systems showed an improvement than using only either music or silence removal. As expected,

after the cascading, the MSR increased, however the FASR reduced due to classification of frames

containing only music as nonspeech.

MSR FASRTotal SADerror

Silence removal 1.49 3.71 5.20Music removal 1.59 6.11 7.69Cascade 2.3 2.71 5.01

Table 4.3: Cascading silence removal and music removal for SAD

32


ZCR confidence score for training music model

The inclusion of only high confidence music frames derived using ZCR values for training the

music model and discarding the low ZCR frames resulted in a lower missed speech rate. This

indicates to the recovery of speech with music background to the speech class. Increase in FASR

is due to speech extracted from jingles.


High ZCR music frames 2.3 2.71 5.01All music frames 3.08 2.32 5.41

Table 4.4: Refining music model using only high ZCR confidence score frames

4.3.2 REPERE dataset results

The REPERE set was tested for SAD results on the January 2013 dev0 dataset off its 6 subsets.

For the dev0 set, which consists of 3 hours of annotations, the results using the model based

GMM-HMM segmenter of the LIUM [25] toolkit resulted in 1.1% FASR and 3.83% MSR. The

proposed MATLAB system performs at 2.2% FASR and 3.2% MSR with a total error of 5.41%

as against the 4.93% by the LIUM toolkit.

The table below shows the SAD results for the complete REPERE corpus which accounts

for 51 hours of audio annotation with 2.5 hours of nonspeech.


Silence removal 1.43 3.27 4.7Cascade 1.45 3.01 4.46

Table 4.5: Results for REPERE with 60 hours annotated

4.4 Speaker Segmentation

The speaker segmentation algorithm used in the proposed system is a growing window search

[21] using the ∆BICdistance as shown in Figure3.4. Starting from the beginning of the audio,

a search is done for a single speaker change. At every change found, the search is restarted from

the next frame. The search window is initialized to 5s and a ∆BICvalue is computed for each

frame in the window. If the maxima of this array becomes greater than a threshold θ, then a

33


change is declared at the point of maxima. If no such maxima is located in the window, the size

of the window is increased by 2s and the same procedure is carried out until a change is detected.

However, only speech frames are processed after discarding nonspeech indicated by the speech

activity detector. After finding the change points in the speech frames, their corresponding

locations in the original audio are found and declared as change points.

In two previous broadcast diarization toolkits [25] and [29] the segmentation is carried out in

2 steps, first the ∆BICbased change detection is performed as mentioned above with threshold

as 0 and then merging of consecutive segments for which the ∆BICscore is positive. The need for

the two steps is due to the oversegmentation from the zero threshold ∆BICbased segmentation.

In order to avoid the two step process, only the maxima which were greater than a threshold θ

were chosen. This significantly reduced the oversegmentation.

∆BIC(xi) = Nlog|Σ| −N1log|Σ1| −N2log|Σ2| −λ

2(d+

1

2d(d+ 1))logN (4.1)

For the speaker segmentation, 19MFCCs with their short time eneries have been used. The

segmentation algorithm varies as O(d6) where d is the number of dimensions. Hence most

speaker segmentation systems [21, 25, 29] do not make use of derivatives of cepstral features

while performing segmentation.

4.5 Choice of Segmentation parameters

The parameters for segmentation were tuned on the NDTV dataset, in order to minimize the

error in diarization error rate. The parameters tuned were θ and λ in equation 4.1. The effect of

the parameters on the DER was studied in two experiments. First, the DER was calculated in

an ORACLE experiment [29], where the system was given the annotation for the speech activity

detection and clustering and only the speaker segmentation information from the system was

used. These experiments were inspired by those performed for the SHoUT toolkit. This tested

the system for the missed segments only, since the segments which were small and were caused

due to false alarm speaker changes would get labeled correctly in the ORACLE.

λ1 10

θ0 0.89 1.24

1000 1.40 2.372000 2.55 3.75

Table 4.6: DER with ORACLE experiment

34


The effect of the false alarm speaker changes is that it increases the number of segments

and hence the size of segments is reduced. To test the effect of false alarm segments, the

best algorithm from the clustering was used (ILP clustering with i-vectors) and the DER was

calculated.

It was observed that low values of θ and λ in result in an oversegmentation. As the θ is

increased the average duration of segments increases. This enabled better speaker modeling for

the segments and resulted in lower DER when combined with the best clustering algorithm.

λ1 10

θ0 31.52 33.41

1000 23.67 16.542000 12.35 16.59

Table 4.7: DER with best clustering algorithm

Hence the combination of θ = 2000 and λ = 1 is the default in the proposed system.

4.6 Speaker Clustering

After the speaker changes have been detected using the speaker segmentation, the speaker clus-

tering subsystem aims to gather together segments from the same speakers. For this the segment

is represented with a speaker model. Pair-wise similarity is computed between all the speaker

models and a clustering algorithm is chosen to perform the grouping of segments.

4.6.1 Choice of speaker model

The proposed system has implementations of two speaker models which have been widely studied

in speaker verification and speaker recognition tasks (i) Gaussian Mixture models and (ii) i-vector

models. The GMM (equation 3.1) is a probabilistic model on the feature space. The features

used here are short time energy concatenated with 19 MFCC features and their first and second

derivatives in a 60 dimensional feature space. The similarity between GMMs is based on cross

likelihoods of model of one segment fitting the data in the other. GMM for a segment is trained

on the feature vectors of the segment using the Expectation-Maximization algorithm to obtain a

diagonal covariance GMM of size 32. While evaluating the system using GMM speaker models,

the CLR and NCLR distances (equations 3.2 and 3.3) have been tested along with HAC and

ILP clustering algorithms.

35


i-vectors are vectors in Rn, were n is of there order of 100. The similarity measures are the

ones used with euclidean vectors viz. cosine distance, Mahalanobis distance etc. The i-vector

is derived from the GMM supervector 1 of the segment after a dimensionality reduction using

factor analysis [36]. The components of the i-vector are the speaker factors of the eigenvoice

vectors of the Total Variability space. The i-vector extraction process is explained below.

i-vector extraction

To obtain the i-vectors, first a speech Universal Background Model (UBM) is trained on a

training data. The UBM is a GMM with large number of gaussians, so that it captures all

possible variabilities in speech in the feature space. In the proposed system the TIMIT and

TIFR datasets have been used for the UBM training. The TIMIT set consists of 168 speakers

uttering 10 English sentences each while the TIFR set consists of 100 speakers uttering 10 Hindi

sentences each, both from native speakers of the respective languages. The UBM is a diagonal

covariance GMM of size 512. UBM training is a one time computation. The UBM is mean-

adapted for the feature vectors of the concerned segment to obtain a GMM for the segment. The

means of the UBM and the adapted segment GMM are concatenated together to get a 30720

sized supervectors (60x512).

The Total Variability space is a subspace of the GMM superspace, that captures all the

speaker and channel related information. T is the low rank matrix whose columns span the

Total variability subspace. For the proposed system, the matrix T is trained using the same

speaker labeled dataset used for UBM training. The T matrix training is also a one time

computation. The i-vector of the segment is the projection of the GMM supervectors onto the

Total Variability subspace.

m = M + Tx (4.2)

where M is the UBM supervector, m is the mean-adapted GMM supervector of the segment.

Thus for every segment, extraction of the i-vector x involves 2 steps – adapting the UBM to

obtain its GMM supervector and extracting the factors of the total variability eigenvectors to

get. The algorithm for training the T matrix from speaker labeled training data is detailed in

[36]. The proposed system uses the MSR Identity toolbox [37] for UBM training, training of the

TV subspace and the i-vector extraction.

While evaluating the system using i-vectors, the dimension of the TV subspace and the

choice of distances between the i-vectors has been examined. Two distance metrics have been

1Supervector is the vector obtained after concatenating all the mean vectors of the GMM

36


Figure 4.4: Extraction of i-vectors

tested for measuring similarity between i-vectors - the cosine similarity metric (eq. 4.3) and

the Mahalanobis distance metric (eq. 4.4), where W in is the within class covariance matrix

determined from the n training i-vectors from S speakers detailed in 4.5. The Mahalanobis

distance is hence also called within class covariance normalization (WCCN). In equation 4.5 of

WCCN computation for the Mahalanobis distance, the vectors ws are mean of the ns i-vectors

of speaker s

D(x, y) = 1− xT y

‖x‖ · ‖y‖(4.3)

D(x, y) = (x− y)TW−1(x− y) (4.4)

W =1

n

S∑s=1

ns∑i=1

(wsi − ws

)(wsi − ws

)T(4.5)

4.6.2 Choice of clustering algorithm

The proposed system is equipped with two clustering algorithms – the traditional hierarchical

agglomerative clustering algorithm and a graphical clustering algorithm called the ILP algorithm

that was recently introduced to speaker diarization in 2013 by Rouvier et al. [43].

HAC

In the HAC system, in every iteration the most similar speaker models are chosen to be merged.

In the merging step, a new speaker model is estimated using the data from all segments of

both speaker models being merged. The similarity matrix is updated with similarity of the new

speaker model from other models. In every iteration the number of clusters reduces by one.

The process is continued until only one cluster remains. The optimum set of clusters is chosen

from among the outputs of every iteration based on an optimality criterion. The HAC can be

implemented using either speaker model. In each iteration of the HAC, the merging requires

37


Figure 4.5: ILP clustering on a complete graph of speaker models [44].

extra computations since the model needs to be retrained, and the distance matrix needs to be

updated with entries corresponding to the merged speaker model.

There are two optimal cluster criteria which have been implemented for the HAC. With the

distance threshold criterion, where the iteration with minimum pairwise similarity greater than

a threshold is declared the optimal set of clusters (equation 3.5). In the Ts optimal cluster

criterion introduced by Nguyen [38], the set of clusters with minimum intra-cluster similarity

and maximum inter-cluster similarity is declared the optimal set of clusters using the formula

3.6

ILP clustering

In the ILP clustering, the k-centres problem is modified to obtain a set of clusters. The original

k-centers problem is to identify K cities out of N cities for building warehouses such that the

longest distance between a city and its nearest warehouse is minimized. In the case of the ILP

formulation, the N segments are similar to N cities and K of those segments are to be chosen

as best segments for the speakers data of the K speakers. However in case of diarization, the

number of speakers (K) is unknown. Hence the modification is brought out in the objective of

the optimization problem 4.6

Consider the set of binary decision variables: Xii

Xii= 1 indicates cluster i is a leader cluster.

Xij= 1 indicates cluster i is assigned to leader cluster j (and hence Xjj= 1 is necessary)

Note that Xji=1and Xij=1 have a different meaning although they both show that i vectors for

38


segments i and j belong to the same cluster. Now, consider the optimization problem 4.6

min

N∑i=1

Xii +1

δ

N∑i=1

N∑j=1

dijXij

s.t.∑

Xij = 1 ∀j

Xij ≤ Xii ∀j

dijXij ≤ δ ∀i, j

Xij ∈ {0, 1} ∀i, j

(4.6)

The objective function to be minimized consists of 2 terms, the first is number of leader

clusters (number of speakers), and the second is the total dispersion of all K clusters. The first

constraint ensures that a segment is assigned to exactly 1 cluster. The second constraint ensures

that a cluster centre is assigned to the same cluster. Constraint 3 prevents assigning a vector to

a cluster farther than a threshold from its leader cluster.

Note that the ILP clustering algorithm does not require any information about the objects

being clustered and depends only on the similarity matrix. The integer program needs to be

converted to a 1-D ILP so that the intlinprog solver in MATLAB can generate a set of clusters.

The only disadvantage of using the ILP clustering algorithm is that the speaker models are

not refined iteratively in the process of obtaining the speaker models as in HAC and hence if the

sizes of the segments are small then the i-vectors chosen as leader i-vectors may not represent the

speaker information completely. This clustering algorithm can be used along with either speaker

model - GMMs or i-vectors since only the similarity matrix is needed to obtain the optimal set

of clusters.

4.7 Evaluation of Speaker Clustering

This section describes clustering experiments performed using the best outputs from the pre-

vious stages of SAD and segmentation. Section 4.6.1 and 4.6.2 present the experiments on the

traditional hierarchical clustering using GMM speaker models and i-vector speaker models re-

spectively. Section 4.6.3 and 4.6.4 present the experiments using the ILP clustering algorithm

on the GMM speaker models and i-vector speaker models in that order.

The experiments presented below were performed on the NDTV dataset. The DER values

presented are the overall diarization error rates, which are averages of the individual DER per

episode weighted with the duration of the episode.

39


Figure 4.6: DER on NDTV dataset: HAC with distance threshold for GMM speakermodels

4.7.1 HAC experiments with GMM speaker models

Hierarchical clustering is performed on a similarity matrix between clusters, and requires a

stopping criterion which decides the optimal set of clusters. Two stopping criteria have been

implemented – (i) distance threshold criterion and (ii) Ts optimal stopping criterion.

Distance threshold criterion

During HAC, as the iteration number increases, the underclustering decreases until the mini-

mum DER clustering is reached. The iterations following the set of optimal clusters, there is

overclustering. This is demonstrated in the graphs below.

Ts optimality criterion

Using the optimality criterion of Nguyen [38] given by equation 3.6, the cluster with farthest

histograms for inter-cluster distances and intra-cluster distances was chosen. Using the NCLR

distance a DER of 22.15% was attained, whereas using the CLR resulted in a DER of 19.83%

4.7.2 HAC with i-vector speaker models

Using HAC with i-vectors. New i-vectors were extracted for every segment obtained in the

cluster merging step. The best result obtained was 16.69% DER for 75 dimensional TV space

with the Mahalanobis distance.

40


Figure 4.7: DER on NDTV dataset: HAC with distance threshold for GMM speakermodels

4.7.3 ILP based experiments with GMM speaker models

ILP clustering was performed using CLR and NCLR distance to construct the distance matrices.

Best result obtained was 19.03% whereas when the NCLR distance was used, the best result

obtained was the 17.27%. The x-axis denotes the threshold present in the ILP optimization

problem’s constraints. The better performance of ILP than 3.6 optimum criterion is in concur-

rence with [50] The NCLR is a better representation of the distance than the CLR, however it

is not suitable for use in HAC, since as the size of the merged cluster increases, the size of the

segment plays a role in decreasing its NCLR distance from other segments 3.2 & 3.3.

The Integer Linear Programming formulation on the other hand offers a holistic trajectory

to reach the optimum clustering. To verify this, the ILP formulation was implemented for the

CLR and NCLR similarity matrix generated using the GMM speaker models and it gives an 11%

relative improvement in the error compared to the best error from the GMM-HAC clustering

algorithm. In literature the ILP has only been tried using i-vectors.

41


Figure 4.8: DER on NDTV dataset: ILP with distance threshold for GMM speaker models

4.7.4 ILP clustering with i-vector speaker models

The ILP clustering was implemented with i-vectors trained on the TIMIT+TIFR dataset. The

following experiments indicate the best dimensions for the Total Variability subspace and the

best choice of distance.

The Mahalanobis distance offers a background compensation method that enhances similar-

ity between segments from same speaker but different background

GMM v/s i-vector

The GMM based speaker modeling gives a very high dimensional representation for the segment

and hence also captures background information as well. Similarity in background could lead to

similarity between segments of different speakers. Background compensation schemes need to be

employed on the features space. On the other hand the i-vectors allow background compensation

through WCCN. Another issue with using GMM speaker models is their high computation time

for segment similarity due to the cross likelihood terms in equations 3.2 and 3.3.

HAC v/s ILP

The HAC though a greedy algorithm for clustering, works as a good approximation. However if

during a step in the clustering, an erroneous merging occurs, it affects the performance of the

later steps significantly. Since a re-estimation of the cluster needs to be done at each step, HAC

is more expensive than ILP. The ILP does a more thorough search by exploring all possible∑NK=1

(NK

)= 2N cluster combinations than the HAC.

42


Figure 4.9: Performance of ILP clustering with i-vector speaker models with varyingdimensions of the Total Variability subspace. Red plot is for the Mahalanobis similarity.Blue plot for the Cosine similarity

43


Table 4.8: Best results from the 2 speaker models and 2 clustering algorithms

HAC ILPGMM 19.45 17.27i-vector 17.11 16.18

4.7.5 Results on REPERE corpus

Previously indicated results for the dev0 subset of the REPERE show a 17.19% DER with

GMM speaker models and 15.46% DER with the i-vector speaker models. For the dev0 subset,

we achieved a 23.19% DER with the HAC-GMM clustering and a 21.02% DER with ILP-i-

vector clustering. The poorer performance compared to the previously attained results could be

because of smaller sized UBM models (2048 as used by LIUM [25]).

The overall DER for the 60 hour REPERE corpus is best for ILP-ivector clustering combi-

nation i.e. 24.4%.

Summary

In this chapter the proposed system and its components were described. The system has been

equipped with state of the art clustering algorithms and speaker models. The system has been

built using MFCCs as foremost feature vectors in every component, although other feature

vectors may be endeavoured. A completely unsupervised speech activity detection algorithm

has been implemented in the system that can be ported for other speech processing tasks.

The speech activity detection uses an existing music vs speech discriminator for building the

nonspeech models from the recording.

44

Chapter 5

Conclusion and Future work

5.1 Conclusion

The aim of this thesis was to study the state-of-the-art techniques in speaker diarization for

specific application to broadcast news audio recordings and develop a MATLAB based system

for the same. The proposed system has been evaluated using the diarization error rate metric

(detailed in Chapter 2) and presented with new additions in unsupervised speech activity de-

tection. The system has 3 main components viz. speech activity detector, ∆BIC based speaker

change detector and a state-of-the-art speaker clustering block. The system has been evaluated

for two news databases - NDTV dataset and the REPERE dataset.

The general purpose speech activity detector is capable of removing silences as well as audible

nonspeech such as music from a recording. The speaker clustering block allows for state-of-the-

art speaker models for representing segments with i-vectors, which can facilitate further work in

fast cross-show diarization.

Experiments were performed on two broadcast news corpora – Indian news dataset from

NDTV and the French REPERE corpus. The NDTV corpus is a 4h15m dataset from one news

show. This dataset was manually annotated for the diarization experiments. The REPERE

dataset of 60h04m was obtained from the French ELDA.

The system is capable of performing speech activity detection without dependence on ex-

ternal training data for nonspeech and speech models. Frame energy and zero crossing rate

have been used as bootstrapping features to construct silence and music models from the audio

recording being processed. A competitive speech activity detection has been achieved with a

two-stage SAD system – a silence detection, followed by a music detection. The results are

comparable to a state-of-the-art GMM-HMM based speech activity detector which uses external

45

Chapter 5. Conclusion and Future work 46

training data from a large dataset for creating nonspeech models.

The i-vector speaker models, which are now state-of-the-art in speaker verification, provide

a low dimensional representation of the speaker information compared to traditional GMM

speaker models. They also offer a computational advantage since distance computation between

i-vectors is much faster compared to cross-likelihood based similarity computation on GMM

speaker models. Hence for real-time diarization systems, i-vectors seem more appealing.

It has been verified in this thesis as indicated in [43] that speaker clustering is achieved

better using a global optimization approach to reach the optimum set of speaker clusters rather

than the traditional greedy optimization approach of the hierarchical agglomerative clustering

(HAC) algorithm. HAC is computationally very expensive, and an erroneous merge step during

the clustering significantly affects the later iterations i.e., error gets propagated. The integer

linear programming (ILP) clustering formulation on the other hand offers a holistic trajectory

to reach the optimum clustering. It is a graphical approach to clustering adapted from the

prevalent k-centres problem in combinatorial optimization. To verify the better performance of

ILP compared to HAC, the ILP formulation was implemented for the CLR and NCLR similarity

matrix generated using the GMM speaker models and it gives an 11% relative improvement in

the error compared to the best error from the GMM-HAC clustering algorithm. In literature

the ILP had only been tried using i-vectors.

5.2 Future Work

Future work on the system development should focus on the following aspects of speaker diariza-

tion:

Refinement of the diarization output by passing it through a Viterbi decoder should be

attempted.

Cross-show diarization is the task of performing speaker clustering across different recordings

to identify segments of the same speakers in different shows. Current momentum of diarization

research is along solving this problem for large databases. Cross diarization should be attempted

using the proposed MATLAB system

Improvements in ILP have shown sufficiently faster implementations by reducing the re-

dundancies in the original ILP, although MATLAB does not support solving these optimization

problems. Solvers such as GUROBI provide support to solving advanced integer linear programs.

It was observed that during the speaker clustering, the segments having background music

were unable to show similarity with segments having clean background using the MFCC-GMM

speaker models owing to the low SNR. Even after using background variability compensation

46

Chapter 5. Conclusion and Future work 47

techniques on i-vector speaker models, the problem persists. Speech enhancement and singing

voice separation prior to parameterising the audio recording should be attempted so that music

in the background of a speaker is suppressed.

47

Bibliography

[1] Inside the secret technology that makes ‘the daily show’ and ‘last week tonight’ work,

http://splitsider.com/2015/03/inside-the-secret-technology-that-makes-the-daily-show-

and-last-night-tonight-work/.

[2] Gerald Friedland, Luke Gottlieb, and Adam Janin. Joke-o-mat: browsing sitcoms punchline

by punchline. In Proceedings of the 17th ACM international conference on Multimedia, pages

1115–1116. ACM, 2009.

[3] Sue E Tranter, Douglas Reynolds, et al. An overview of automatic speaker diarization

systems. Audio, Speech, and Language Processing, IEEE Transactions on, 14(5):1557–1565,

2006.

[4] Xavier Anguera Miro. Robust speaker diarization for meetings. Universitat Politecnica de

Catalunya, 2007.

[5] Pyannote - collaborative annotation of audio-visual documents, http://pyannote.github.io/.

[6] Juliette Kahn, Olivier Galibert, Ludovic Quintard, Matthieu Carre, Aude Giraudel, and

Philippe Joly. A presentation of the repere challenge. In Content-Based Multimedia Indexing

(CBMI), 2012 10th International Workshop on, pages 1–6. IEEE, 2012.

[7] Nist: The nist rich transcription 2009 (rt’09) evaluation.

[8] Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Fried-

land, and Oriol Vinyals. Speaker diarization: A review of recent research. Audio, Speech,

and Language Processing, IEEE Transactions on, 20(2):356–370, 2012.

[9] Gerald Friedland, Adam Janin, David Imseng, Xavier Anguera Miro, Luke Gottlieb, Marijn

Huijbregts, Mary Tai Knox, and Oriol Vinyals. The icsi rt-09 speaker diarization system.

Audio, Speech, and Language Processing, IEEE Transactions on, 20(2):371–381, 2012.

48

Bibliography 49

[10] Martin Zelenak, Henrik Schulz, Francisco Javier Hernando Pericas, et al. Albayzin 2010

evaluation campaign: speaker diarization. 2010.

[11] Sylvain Meignier and Teva Merlin. Lium spkdiarization: an open source toolkit for diariza-

tion. In CMU SPUD Workshop, volume 2010, 2010.

[12] Simon Bozonnet, Nicholas WD Evans, and Corinne Fredouille. The lia-eurecom rt’09

speaker diarization system: enhancements in speaker modelling and cluster purification.

In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference

on, pages 4958–4961. IEEE, 2010.

[13] T Nguyen, H Sun, S Zhao, SZK Khine, HD Tran, TLN Ma, B Ma, ES Chng, and H Li.

The iir-ntu speaker diarization systems for rt 2009. In RT’09, NIST Rich Transcription

Workshop, May 28-29, 2009, Melbourne, Florida, USA, volume 14, pages 17–40, 2009.

[14] Arlindo Veiga, Carla Lopes, and Fernando Perdigao. Speaker diarization using gaussian

mixture turns and segment matching. Proc. FALA, 2010.

[15] Hari Krishna Maganti, Petr Motlicek, and Daniel Gatica-Perez. Unsupervised speech/non-

speech detection for automatic speech recognition in meeting rooms. In Acoustics, Speech

and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4,

pages IV–1037. IEEE, 2007.

[16] Wai Nang Chan, Tan Lee, Nengheng Zheng, and Hua Ouyang. Use of vocal source features

in speaker segmentation. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006

Proceedings. 2006 IEEE International Conference on, volume 1, pages I–I. IEEE, 2006.

[17] Sree Harsha Yella, Andreas Stolcke, and Malcolm Slaney. Artificial neural network features

for speaker diarization. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages

402–406. IEEE, 2014.

[18] Neville Ryant, Mark Liberman, and Jiahong Yuan. Speech activity detection on youtube

using deep neural networks. In INTERSPEECH, pages 728–731, 2013.

[19] Xavier Anguera and Jean-Francois Bonastre. Fast speaker diarization based on binary keys.

In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference

on, pages 4428–4431. IEEE, 2011.

[20] Douglas A Reynolds and P Torres-Carrasquillo. The mit lincoln laboratory rt-04f diarization

systems: Applications to broadcast audio and telephone conversations. Technical report,

DTIC Document, 2004.

49

Bibliography 50

[21] Scott Chen and Ponani Gopalakrishnan. Speaker, environment and channel change detec-

tion and clustering via the bayesian information criterion. In Proc. DARPA Broadcast News

Transcription and Understanding Workshop, volume 8. Virginia, USA, 1998.

[22] Xavier Anguera and J Hernando. Xbic: Real-time cross probabilities measure for speaker

segmentation. Univ. California Berkeley, ICSIBerkeley Tech. Rep, 2005.

[23] Shih-Sian Cheng, Hsin-Min Wang, and Hsin-Chia Fu. Bic-based speaker segmentation using

divide-and-conquer strategies with application to speaker diarization. Audio, Speech, and

Language Processing, IEEE Transactions on, 18(1):141–157, 2010.

[24] Matthew A Siegler, Uday Jain, Bhiksha Raj, and Richard M Stern. Automatic segmenta-

tion, classification and clustering of broadcast news audio. In Proc. DARPA speech recog-

nition workshop, volume 1997, 1997.

[25] Mickael Rouvier, Gregor Dupuy, Paul Gay, Elie Khoury, Teva Merlin, and Sylvain Meignier.

An open-source state-of-the-art toolbox for broadcast news diarization. Technical report,

Idiap, 2013.

[26] Herve Bredin and Johann Poignant. Integer linear programming for speaker diarization

and cross-modal identification in tv broadcast. In the 14rd Annual Conference of the In-

ternational Speech Communication Association, INTERSPEECH, 2013.

[27] Daniel Moraru, Sylvain Meignier, Corinne Fredouille, Laurent Besacier, and Jean-Francois

Bonastre. The elisa consortium approaches in broadcast news speaker segmentation during

the nist 2003 rich transcription evaluation. In Acoustics, Speech, and Signal Processing,

2004. Proceedings.(ICASSP’04). IEEE International Conference on, volume 1, pages I–373.

IEEE, 2004.

[28] Costas Panagiotakis and George Tziritas. A speech/music discriminator based on rms and

zero-crossings. Multimedia, IEEE Transactions on, 7(1):155–166, 2005.

[29] Marijn Anthonius Henricus Huijbregts. Segmentation, diarization and speech transcription:

surprise data unraveled. 2008.

[30] Jordi Luque, Xavier Anguera, Andrey Temko, and Javier Hernando. Speaker diarization

for conference room: The upc rt07s evaluation system. In Multimodal Technologies for

Perception of Humans, pages 543–553. Springer, 2008.

[31] Olivier Galibert and Juliette Kahn. The first official repere evaluation. In SLAM@ INTER-

SPEECH, pages 43–48, 2013.

50

Bibliography 51

[32] Xuan Zhu, Claude Barras, Sylvain Meignier, and Jean-Luc Gauvain. Combining speaker

identification and bic for speaker diarization. In INTERSPEECH, volume 5, pages 2441–

2444, 2005.

[33] Benoit Favre, Geraldine Damnati, Frederic Bechet, Meriem Bendris, Delphine Charlet,

Remi Auguste, Stephane Ayache, Benjamin Bigot, Alexandre Delteil, Richard Dufour, et al.

Percoli: A person identification system for the 2013 repere challenge. In SLAM@ INTER-

SPEECH, pages 55–60, 2013.

[34] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using

adapted gaussian mixture models. Digital signal processing, 10(1):19–41, 2000.

[35] Jesper Hojvang Jensen, Daniel PW Ellis, Mads G Christensen, and Soren Holdt Jensen.

Evaluation distance measures between gaussian mixture models of mfccs. In ISMIR 2007:

Proceedings of the 8th International Conference on Music Information Retrieval: September

23-27, 2007, Vienna, Austria, pages 107–108. Austrian Computer Society, 2007.

[36] Najim Dehak, Patrick Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-

end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE

Transactions on, 19(4):788–798, 2011.

[37] Seyed Omid Sadjadi, Malcolm Slaney, and Larry Heck. Msr identity toolbox v1. 0: A

matlab toolbox for speaker recognition research. Speech and Language Processing Technical

Committee Newsletter, 2013.

[38] Trung Hieu Nguyen, Eng Siong Chng, and Haizhou Li. T-test distance and clustering

criterion for speaker diarization. In in Interspeech 2008. Citeseer, 2008.

[39] Deepu Vijayasenan, Fabio Valente, and Herve Bourlard. Agglomerative information bot-

tleneck for speaker diarization of meetings data. In Automatic Speech Recognition & Un-

derstanding, 2007. ASRU. IEEE Workshop on, pages 250–255. IEEE, 2007.

[40] Stephen Shum, Najim Dehak, Ekapol Chuangsuwanich, Douglas A Reynolds, and James R

Glass. Exploiting intra-conversation variability for speaker diarization. In INTERSPEECH,

pages 945–948, 2011.

[41] Patrick Kenny, Douglas Reynolds, and Fabio Castaldo. Diarization of telephone conver-

sations using factor analysis. IEEE Journal of Selected Topics in Signal Processing, 4(6):

1059–1070, 2010.

51

Bibliography 52

[42] Jan Silovsky and Jan Prazak. Speaker diarization of broadcast streams using two-stage

clustering based on i-vectors and cosine distance scoring. In ICASSP, pages 4193–4196.

IEEE, 2012.

[43] Mickael Rouvier and Sylvain Meignier. A global optimization framework for speaker di-

arization. In Odyssey Workshop, Singapore, 2012.

[44] Gregor Dupuy, Sylvain Meignier, Paul Deleglise, and Yannick Esteve. Recent improvements

on ilp-based clustering for broadcast news speaker diarization. In Proceedings of Odyssey,

2014.

[45] John Saunders. Real-time discrimination of broadcast speech/music. In icassp, pages 993–

996. IEEE, 1996.

[46] Zhu Liu, Yao Wang, and Tsuhan Chen. Audio feature extraction and analysis for scene

segmentation and classification. Journal of VLSI signal processing systems for signal, image

and video technology, 20(1-2):61–79, 1998.

[47] Sue E Johnson and Philip C Woodland. A method for direct audio search with applications

to indexing and retrieval. In Acoustics, Speech, and Signal Processing, 2000. ICASSP’00.

Proceedings. 2000 IEEE International Conference on, volume 3, pages 1427–1430. IEEE,

2000.

[48] Daben Liu Francis Kubala. Fast speaker change detection for broadcast news transcription

and indexing. 1999.

[49] Sylvain Meignier, Daniel Moraru, Corinne Fredouille, Jean-Francois Bonastre, and Laurent

Besacier. Step-by-step and integrated approaches in broadcast news speaker diarization.

Computer Speech & Language, 20(2):303–330, 2006.

[50] Hector Delgado, Corinne Fredouille, and Javier Serrano. Towards a complete binary key

system for the speaker diarization task. In Fifteenth Annual Conference of the International

Speech Communication Association, 2014.

52