Speaker Diarization of Broadcast News Audios Submitted in partial fulfillment of the requirements of the degree of Bachelor of Technology and Master of Technology by Parthe Pandit (Roll no. 10D070009) Supervisor: Prof. Preeti Rao Department of Electrical Engineering Indian Institute of Technology Bombay 2015
62
Embed
Masters Thesis: Speaker Diarization for Broadcast News Audios
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Speaker Diarization of BroadcastNews Audios
Submitted in partial fulfillment of the requirements
of the degree of
Bachelor of Technology and Master of Technology
by
Parthe Pandit
(Roll no. 10D070009)
Supervisor:
Prof. Preeti Rao
Department of Electrical Engineering
Indian Institute of Technology Bombay
2015
Dedicated to
George, Elaine, Kramer & Jerry
Parthe Pandit/ Prof. Preeti Rao (Supervisor): “Speaker Diarization of Broadcast News
Audios”, Dual Degree Dissertation, Department of Electrical Engineering, Indian Institute of
Technology Bombay, July 2015.
Abstract
Speaker Diarization is a multimedia indexing technology that makes use of audio information to
answer the question ”Who spoke when?” This thesis presents a step-by-step speaker diarization
system implemented in MATLAB that is evaluated using the Diarization Error Rate (DER) met-
ric. The proposed system, designed for segmenting audio recordings of broadcast news, provides
implementations of state-of-the-art i-vectors as well as the traditional GMM speaker models. A
graphical clustering algorithm introduced by Rouvier et al. in 2013 has also been implemented.
This clustering algorithm offer lower DER as well as a computational advantage compared to
the conventional GMM based hierarchical agglomerative clustering. An unsupervised speech
activity detector (SAD) has also been developed that discards nonspeech in two stages - silence
removal followed by music removal. The music removal subsystem has been adapted to classify
speech segments with background music, e.g. news headlines sections, as speech. The proposed
SAD achieves a favourable performance on the January 2013 subset of the REPERE corpus
compared to the supervised SAD of the LIUM diarization toolkit.
Index terms: unsupervised, speech activity detection, MATLAB, ILP clustering, REPERE
Table 2.3: Annotated shows in REPERE corpus and their respective times
8
Chapter 3
State of the Art in Speaker
Diarization
The problem of speaker diarization involves answering “Who spoke when”. It is generally broken
down into answering “is anyone speaking?” and then answering “which speaker in the audio is
speaking?” The first step is called speech activity detection, which is a pre-processing step
common in speaker recognition, speech recognition, speech coding and speech enhancement
[8]. The latter problem can be approached as finding the change in speaker (called speaker
segmentation) and then combining the contiguous segments belonging to the same speaker under
a unique label (called speaker clustering).
Initially in the late 1990’s, when research in diarization was still in its nascent stages, few
systems attempted to perform speech activity detection as a by-product of the segmentation and
clustering [8]. Nonspeech was thought to be just another speaker. But owing to the acoustic
variability of nonspeech, systems with explicit speech activity detectors performed much better.
Often, the speaker segmentation and speaker clustering are performed iteratively and hence
shown as a single block [9] as in figure 3.1.
In this chapter previously used methods in speaker diarization have been reviewed and the
state-of-the-art algorithms implemented by various systems specialized in diarization of broad-
cast news, meeting recordings and telephone conversations are compared. In the recent years,
the National Institute of Science and Technology (NIST), USA have organised rich transcrip-
tion tasks for broadcast news and telephone diarization (2003-‘04) and for meeting diarization
(2005, ‘07, ‘09). The Albayzin campaign of 2010, the ESTER (2008) [10] and REPERE (2012-14)
broadcast audio and video diarization campaigns have fueled research in broadcast news diariza-
tion and attracted developers to participate with their diarization engines to set up benchmarks.
Some of these competitor systems have also been reviewed in this chapter.
9
Chapter 3. State of the Art in Speaker Diarization 10
Figure 3.1: A typical speaker diarization system
3.1 Feature Extraction for Speaker Diarization
For the task of speaker diarization, acoustic features that discriminate speaker information in the
spectrogram but are invariant to the phone sequence being uttered are desired. Mel-frequency
cepstral coefficients (MFCCs) or Perceptual Linear Prediction (PLP) coefficients, although not
designed to distinguish between speakers, have been used widely in the areas of speaker verifica-
tion and speaker recognition. Since a similar task of modelling speaker information is tackled in
speaker diarization, MFCCs and other cepstral features are the most commonly used features.
During speaker segmentation 12-19 MFCCs have been used along with the short time energy,
while during clustering usage of higher order derivatives of these MFCCs has been reported
[11]. LFCCs extracted using a linear filter bank instead of the Mel scale filter bank [12] and
Linear Prediction Cepstral Coefficients (LPCCs) [13] have also been tested but no conclusion
has been reached regarding the better performance of either. Typical sizes of analyses windows
are 25-30ms with frame hops of 10ms.
For speech activity detection, acoustic features that discriminate between speech and non-
speech are sought after. Features such as energy [13], zero-crossing rate, spectral centroid,
spectral roll-off and spectral flux [14] have been used previously in speech activity detection.
However the use of these feature vectors has always been seen in concatenation with cepstral
features.
Other than the above mentioned short time analysis features, 4Hz modulation frequency
features that convey long term characteristics of the acoustic signal have also been investigated
10
Chapter 3. State of the Art in Speaker Diarization 11
[15], and have been applied in the speaker overlap detection and speech activity detection. A
major challenge faced in these features though is the high dimensionality of the features and
the computational cost associated with it. Long term cumulative features drawn over texture
windows of 500ms such as median of pitch, long time average spectrum, deviation of the 4th
and 5th formants, harmonics to noise ratio, formant dispersion etc. have shown to be of use for
fast cluster initialization [9], while features providing vocal source and vocal tract information
[16] have shown better speaker discrimination when used along with MFCCs.
Recently Slaney et al. used features derived as activations of the bottleneck layer of a
neural network. The artificial neural network was trained to discriminate 500ms segments as
belonging to same or different speaker [17]. In another work [18], a 50% relative improvement was
reported for speech activity detection on a large Youtube corpus when a two dimensional soft-
max activation of a deep neural network was concatenated with 13 MFCC. Another interesting
feature space explored in 2011 have sacrificed diarization error only slightly to obtain a 10x speed-
up using binary valued features for performing clustering [19]. In this work, acoustic MFCC
features of segments are transformed into a binary feature space using likelihoods obtained from
GMMs.
3.2 Segmentation
In audio segmentation, the task is to create homogeneous and contiguous chunks of audio that
show dissimilarity from its neighbouring segments. It is also called acoustic change detection.
We will look at two approaches to audio segmentation with more focus on methods used in
speaker segmentation applied to speaker diarization.
3.2.1 Metric based segmentation
One of the most common audio segmentation methods to date is metric-based segmentation.
These methods are very popular in music segmentation tasks as well. In metric based segmenta-
tion a distance metric is first defined between two audio segments, that indicated their similarity.
Then a change detection strategy is implemented using this metric. Compared to model based
methods, these methods have great advantage since they do not need any information about the
data a priori.
For music segmentation, distances are calculated between the feature directly. In speech
processing however, the features (generally cepstral features) used are not suitable for frame-
wise distance computation for comparing speaker similarity, due to their variability with the
11
Chapter 3. State of the Art in Speaker Diarization 12
phones uttered. To aggregate speaker information from longer segments, it is assumed that
the features of every segment come from a probability distribution. Distance comparison is
done between these probability distributions using statistical similarity measures such as the
KL divergence, Cross Likelihood Ratio, Bayesian Inference Criterion etc. The most commonly
used probability distribution for modeling chunks of feature vectors during speaker segmentation
is the full covariance multivariate Gaussian distribution.
Bayesian Inference Criterion
Bayesian Inference Criterion (∆BIC) is a model selection criterion i.e., is a statistical criterion
that compares available models for representing the data. The aim during this selection is to
calculate if there is any over-fitting. For a set of vectors X, the BIC of the model M is one such
criterion defined as:
BIC(X,M) = logP (X|M)− λ ·#(M) · logN
The first term calculates the likelihood of the data given the model, whereas the latter term
penalizes it proportional to the number of parameters #(M) that the model uses and the size
of available data on which the model was trained. The second term is called the complexity of
the model.
The BIC can be applied to indicate whether the two sets of feature vectors being compared
for similarity are drawn from the same distribution or from different distributions. To measure
similarity between blocks X1 and X2 the following hypotheses need to be compared: H0: The
feature vectors from X1 and X2 are from same distribution and H1: The feature vectors from
X1 and X2 are from separate distributions. Let the model for H0 be M which would be trained
on X i.e., X1 concatenated with X2 and let the models for H1 be M1 and M2 for X1 and X2
respectively. We define
∆BIC = BIC(M)−BIC(M1)−BIC(M2)
A positive value for ∆BICsuggests dissimilarity between the two blocks X1 and X2 and hence
indicates that there is a change between segments X1 and X2.
Chen et al. built a completely unsupervised system using ∆BICand their method has been
replicated a number of times for speaker/ environment change detection [21] in the speaker
diarization domain. Improvements were made by [22] and [23] to make faster implementations
of change detection using the BIC approach by reducing the number of computations with some
compromise on accuracy.
Metric based segmentation methods are implemented in two strategies, a fixed sliding window
12
Chapter 3. State of the Art in Speaker Diarization 13
Figure 3.2: BIC for change detection [20]
strategy and a growing window search strategy. In the former, there is a window of fixed size,
the centre of which is being inspected for a change [24]. If the feature vectors on either side of the
midpoint are better modeled by separate distributions, resulting in a higher distance between
distributions, the midpoint is declared as a change point. The size of the sliding window is
typically 5s and the two 2.5s segments are compared for a similarity. With a larger window size
the two segments would be modeled better. However it would have higher chances of missing
a change if more change points enter the window under consideration, since the probability
distribution estimated on the segment may get contaminated.
When implementing the growing window search strategy, a single change is pursued from
the start of the recording in a window of certain size generally about 5s. If no change is detected
in this window, the size of the window is increased and a change is searched in the new window.
After the first change is detected, the search is reset from the last change detected. The growing
window method has been reported with the BIC metric [21]. In recent years, the [25] and [26]
systems have replicated the growing window BIC segmentation followed by a BIC clustering
that merges only consecutive segments to reduce false alarm speaker changes.
3.2.2 Model based segmentation
Model based segmentation methods train a GMM for every segmentation class. These GMMs
are used as a PDF in a hidden Markov model (HMM) where each state is connected to every
other state with equal transition probability. The Viterbi decoding using this HMM gives a
segmentation of the audio recording. A major disadvantage of the model based segmentation
is that the GMMs needs to be known before hand, hence need some external training data.
13
Chapter 3. State of the Art in Speaker Diarization 14
Figure 3.3: Sliding window search for speaker change detection. Distance is computedbetween two halves of the sliding window and plotted with time. Peaks in the distanceindicate a change [23].
14
Chapter 3. State of the Art in Speaker Diarization 15
Figure 3.4: Growing window search for speaker change detection. Every search is for asingle change point. Search is reset after the change is found [23].
However when segmentation and clustering are performed iteratively, the output produced by
the speaker clustering algorithm gives a set of speaker segments as training data for GMMs of
the next iteration to refine speaker segmentation. A pre-clustering is often performed to get
an initial grouping of audio segments [9] with each grouping showing resemblance of speaker
information.
Often model based segmentation methods have been used only as a post processing step to
achieve a refined segmentation [27]. Such model based techniques are famous in segmentation
during speech activity detection [25] where the acoustic change being looked at is between speech
and nonspeech. In some telephone diarization systems [20], there has been a pre-segmentation
of the audio recording based on bandwidth and gender using GMMs trained for each of the 4
classes (2 bandwidths x 2 genders).
The model based segmentation methods are more famous in meeting diarization systems.
With the advent of better clustering algorithms with i-vector speaker models, the focus has
shifted to performing speaker segmentation and speaker clustering separately for diarization.
However use of the GMM-HMM framework for refinement is still popular [25]
15
Chapter 3. State of the Art in Speaker Diarization 16
3.3 Speech Activity Detection
The task of finding contiguous segments of speech in an audio and segregating them from other
types of sounds is called speech activity detection (SAD). It is beneficial for speech processing
systems since it is practical to process only speech segments rather than entire recordings.
It makes a design more efficient by saving computation time and resources. Apart from the
computational advantages, the absence of an SAD often causes insertion errors in ASR systems.
Hence speech activity detection is a fundamental task in almost all fields of speech processing -
coding, enhancement and recognition [8]
In speaker diarization, the error metric itself highlights the need for a speech activity detec-
tor since missed speech and false alarm speech are included in the diarization error rate metric.
Moreover, with limited speaker data from small speech segments, presence of non-speech con-
taminates the estimated speaker models thereby affecting the performance of the diarization
system. Initial approaches to diarization tried to let SAD be a by-product of the diarization
system [8] by letting nonspeech be a single cluster which would be discarded at the end. However
it was soon noticed that systems having an explicit SAD gave better results.
SAD is often performed using frame-wise classification. Statistical models are trained and
estimated on a feature space most suitable for discriminating the speech and nonspeech classes.
In most cases, Gaussian mixture models are the statistical models used and the feature space
is in most cases cepstral features. Some works have reported use of acoustic features such as
energy [13], zero-crossing rate [28], spectral flux [14]. A few speech activity detectors that were
used in previous diarization competitions and campaigns for both the meeting and broadcast
news domain have been reviewed in subsections 3.3.1 & 3.3.2.
3.3.1 Systems participating in NIST-RT evaluations
The NIST organised rich transcription evaluations which are now the current benchmark in
meeting diarization. These benchmarks consist of results obtained by four participant systems
[8]. Typically 1-3% missed speech error and 2-4% false alarm speech rates are the state-of-the-art
in speech activity detection.
The SHoUT diarization toolkit for SAD [29] uses a bootstrap segmentation performed using
speech and nonspeech models pre-trained on a dataset for Dutch broadcast news. It is followed
by an iterative classification using a Viterbi decoder on 1 HMM with 2 states representing speech
and nonspeech. The use of an HMM allows to control the minimum duration of the speech and
nonspeech thereby preventing sporadic transitions from one class to another. The system uses
16
Chapter 3. State of the Art in Speaker Diarization 17
Figure 3.5: Model based SAD from SHoUT toolkit [29] using a GMM-HMM system
17
Chapter 3. State of the Art in Speaker Diarization 18
12 MFCCs concatenated with zero crossing rate and their first and second derivatives in a 39
dimensional feature vector. The system was used by ICSI [9] and LIA-Eurecom[12], although
the feature vectors used by the latter team for the iterative classification consisted of linear
frequency cepstral coefficients (LFCC).
The UPC system [30] made use of modified support vector machines (Proximal SVMs) with
Gaussian kernels to segregate the speech and nonspeech in the audio. The modification allowed
for faster retraining of SVMs as suited for an iterative classification.
The IIR-NTU [13] system performed a bootstrap segmentation based on an energy derived
confidence score. An iterative classification with GMMs trained for speech and nonspeech, using
high confidence frames from the bootstrap segmentation, followed the initial segmentation to
refine the speech and nonspeech classes. The authors reported use of Linear Prediction Cep-
strum Coefficients [13] for both the bootstrap segmentation and the iterative classification. This
approach was completely independent of external training data for the speech and nonspeech
models
3.3.2 Broadcast news systems
In the LIUM diarization toolkit [25], the authors developed a model based segmentation system
for speech activity detection using an 8 state HMM with 2 states of silence (wide and narrow
band), 3 states of wide band speech (clean, over noise or over music), 1 state of narrow band
speech, 1 state of jingles, and 1 state of music. Each state is modeled with a GMM of size 64 of
MFCCs, their deltas and delta-deltas. All the models were trained using the extensive data for
each model from the ESTER1 dataset. This system resulted in a 1.1% false alarm speech and
3.9% missed speech on the dev0 subset of the REPERE corpus. Their results on other databases
ESTER2 and ETAPE are reported in [25]. Besides speech activity detection the LIUM toolkit
also performs a gender and bandwidth detection. This also uses a model based segmentation
with 128 sized diagonal GMMs for each of the 4 classes (2 genders x 2 bandwidths) and a feature
warping.
The Albayzin 2010 campaign saw five competing systems. The best results for SAD were
reported by [14]. Although the DER was much worse than others (55% DER), their SAD error
stood best at 3.4% (1.1% missed and 2.3% false alarm). The authors reported using multi-layer
perceptrons instead of GMMs to model emission probabilities of a 5 state hybrid NN-HMM
system. The feature space was also expanded. 16 MFCCs concatenated with 8 other audio
features - energy, zero-crossing rate, spectral centroid, spectral roll-off, maximum normalized
correlation coefficient and its frequency, harmonicity measure and spectral flux. Information
18
Chapter 3. State of the Art in Speaker Diarization 19
regarding other participating systems in the Albayzin campaign is mentioned in [10].
In the REPERE 2012-2014 evaluations, three consortia took part - SODA, QCompere and
PERCOL [31]. The SODA consortia used the LIUM toolkit above. The QCompere system had a
4 state HMM similar to the LIUM toolkit one state each for speech, silence, noise and music [32]
modelled by GMMs of size 64. The PERCOL system [33] performed a 3 class GMM based SAD.
Interestingly their 3 classes were non-speech, overlapping speech and non-overlapping speech,
each modeled by 256 sized GMMs trained from the ETAPE corpus. The overlap detection
reportedly also improved the DER than the baseline clustering system.
3.4 Clustering
Clustering is a common problem in statistical data analysis. It has been addressed in many
scientific fields right from exploratory data mining to community detection in social networks. It
is the process of grouping a set of objects such that objects in each group, called cluster, are more
similar to each other than they are to objects in other groups or clusters. The objects could be
points in a vector space or even statistical models. The similarity mentioned above is a distance-
like measure defined between the objects by the user. The word similarity is used because the
measure defined need not satisfy all the properties of a norm viz. non-negativity, triangle
inequality and symmetry. The words similarity and distance have been used interchangeably
here, with less distance meaning more similarity and vice versa.
The process of clustering is generally translation invariant and hence the relative position
of the objects in their space is more relevant rather than the objects themselves. Indeed this
relative position of the objects is indicative of pairwise similarity. For the problem of speaker
diarization, the aim is to perform clustering of segments of audio based on the active speaker in
each segment. Each cluster should ideally represent a single speaker.
The dimensionality of the spectrogram of a single segment is large and comparison between
these segments based on their spectrogram is not computationally viable. Hence the segment
needs to be quantified in a low dimensional space to compare the similarity of their speaker
information. A few speaker models that have been utilized in the past in the fields of speaker
verification and speaker recognition have been reviewed in section 3.4.1. Every speech segment
would have a representative vector or a statistical model which is characterizes the speaker
information. Clustering is performed on these speaker models. Section 3.4.2 reviews a traditional
clustering algorithm and a state-of-the-art algorithm based on a graphical approach.
19
Chapter 3. State of the Art in Speaker Diarization 20
3.4.1 Speaker models for clustering
Since speaker diarization needs to capture speaker information from audio segments, speaker
models commonly used in speaker verification and speaker recognition are adopted. The two
main speaker models GMMs and i-vectors are explained below. Off these the i-vectors have
recently become state of the art in speaker verification tasks.
Gaussian Mixture Models
Gaussian Mixture Models (GMMs) of cepstral features are often used to model speakers. A
Gaussian mixture model is a popular tool for modeling multi-modal data and possess the fol-
lowing form.
p(x) =N∑i=1
wiN (µi,Σi, x) s.t.
N∑i=1
wi = 1 (3.1)
Since the segment durations can be small, the number of feature vectors available from
a segment is sometimes insufficient to estimate a full Gaussian mixture model. To overcome
this problem pre-trained Universal Background Model (UBM) is adapted for the segment to
obtain its speaker model [34]. The UBM is a comprehensive model for data from multiple
speakers combined together that captures variability of speech. For GMMs of cepstral features,
different statistical similarity measures have been investigated earlier such as the symmetric
KL divergence, normalized cross likelihood ratio (NCLR) etc. [35]. The KL divergence is an
information theoretic measure of how different the two probability distributions are from each
other, while the cross likelihood ratio compare P (X1|M2) and P (X2|M1) (equation 3.2 & 3.3).
CLR(X1, X2) = logP (X1|M1)
P (X1|M2)+ log
P (X2|M2)
P (X2|M1)(3.2)
NCLR(X1, X2) =1
|X1|log
P (X1|M1)
P (X1|M2)+
1
|X2|log
P (X2|M2)
P (X2|M1)(3.3)
where Mi is the model estimated on Xi. As we can see, if feature vectors of segment X1 and
X2 come from the same speaker, X1 it fits the model of segment X2 well, so the cross likelihood
increases, decreasing the distance.
Recent experiments in the SV and SR fields noted that only the means of gaussians in a GMM
contain most speaker related information. Due to the high variability of the covariance matrices
and mixture weights with respect to utterances [34], they are not reliable indicators of speaker
information. Hence, instead of calculating the above likelihood scores, the means of a GMM are
20
Chapter 3. State of the Art in Speaker Diarization 21
concatenated to get a single vector (called the GMM supervector) in a high dimensional vector
space. Distance measures such as the cosine distance and Mahalanobis distance [36] have been
investigated on this space. To make a comparison between two GMM supervectors they need to
be adapted from the same UBM to make sure that the corresponding mean vectors of the GMM
are being compared between segments. The adaptation algorithm is detailed in [34].
I-vector representation
The concept of i-vectors was first introduced in speaker verification as a feature extraction
from GMMs to reduce the dimensionality of the GMM hyper parameters. With the UBM sizes
being of the order of 512, 1024 or even 2048 Gaussians in some GMM systems, the size of the
supervector becomes very large to do further computation on the supervector. Instead, using
factor analysis for reducing the dimensionality of the supervector led to a new representative
vector with a few hundred dimensions. This subspace, called the total variation subspace, is
hypothesized to contain spectral information of the speaker and background.
m = M + Tx (3.4)
where m is the mean adapted supervector of the utterance for which the i-vector x is sought. M
is the mean supervector of the UBM. The matrix T is a tall low rank matrix representing the
total variability subspace which needs to be learned on a training dataset. Although supervectors
typically have tens of thousands of dimensions, this representation constrains all supervectors
to lie in an affine subspace of the supervector space. The dimension of the affine subspace is at
most a few hundred.
i-vector extraction requires speaker labeled training data with multiple utterances of the same
speaker with possible variations in utterances in terms of their phonetic balance and background
noise. The training algorithm for the total variability subspace [36] and the i-vector extraction
from the Baum-Welch statistics of the utterance have been implemented in the MSR Identity
toolkit [37].
3.4.2 Clustering Algorithms
Given the similarity matrix between the speaker GMMs or i-vectors, a clustering algorithm aims
at reaching the best set of clusters with minimum intra-cluster variance and maximum inter-
cluster variance We will look at two clustering algorithms used previously in diarization – (i)
hierarchical agglomerative clustering (HAC) and (ii) integer linear program (ILP) clustering,
21
Chapter 3. State of the Art in Speaker Diarization 22
Figure 3.6: Hierarchical agglomerative clustering
which use different solving techniques and also have different criteria for arriving at the the best
set of speaker clusters.
Hierarchical Agglomerative Clustering
HAC is a greedy algorithm i.e. it makes a locally optimal choice at each stage with the hope of
finding the global optimum.In an iterative process, the 2 most similar clusters are merged into a
single cluster. The number of clusters reduces by 1 at each step. This iterative process continues
until only one cluster remains. While merging 2 clusters, the data from segments corresponding
to the 2 clusters is concatenated and a single speaker model is re-calculated on it. The distances
of every other cluster with this newly formed cluster are re-calculated to update the similarity
matrix for the next step.
Step 0: Calculate similarity matrix (Xi,Xj)
Step 1: Find the i∗ and j∗ such that i∗ 6= j∗ & (Xi∗ , Xj∗) = mini,j(Xi, Xj)
Step 2: (Merge step) Replace Xi∗ and Xj∗ by a single object Xk∗ . k∗ = min(i∗, j∗)
Step 3: Update similarity matrix (Xi,Xk∗).
Step 4: If number of clusters > 1, Go to step 1.
Step 5: Calculate best set of clusters using optimality criterion
The optimal set of clusters is chosen based on an optimality criterion. One optimality
criterion is to choose the set of clusters where the minimum inter cluster distance is greater
22
Chapter 3. State of the Art in Speaker Diarization 23
than a threshold. Another criterion was proposed by Nguyen [38] in which from among the
clusters from every iteration, the set of clusters where the histograms of intra-cluster distances
and inter-cluster distances are farthest from each other is chosen.
argmax k mini 6=j(X(k)i , X
(k)j ) ≥ θ (3.5)
where (X(k)i , X
(k)j ) is the similarity matrix in the kth iteration
argmax|minter −mintra|√
σ2interninter
+σ2intranintra
(3.6)
where minter, σinter and ninter are the mean, standard deviation and number of elements in the
inter-cluster distances and similarly for the intra-cluster distances.
The ICSI system [9] performed an initial clustering of segments of 1s duration using prosodic
long term features. This segmentation was then refined iteratively in a GMM-HMM framework
over MFCC feature vectors derived from segments within each cluster. Each state of the HMM
represented a speaker and was modelled by a GMM. The use of the HMM allowed adding a
constraint for obtaining contiguous speech turns of 2s. In each iteration, the number of clusters
was reduced by merging state GMMs in a HAC using the BIC distance. The IIR-NTU system [13]
used LPCCs to form 30 initial clusters from uniformly dividing the audio, and then iteratively
used the same GMM-HMM framework to perform HAC with CLR distance. The LIA-Eurecom
system [12] uses the same framework, although in their method, they took a top-down approach
(also called divisive clustering) instead of HAC and performed a splitting of states starting from
a single state. Many other systems have been implemented that use the HAC approach to
clustering with different distances [19, 39].
i-vector speaker models adopted from speaker verification were first used in speaker diariza-
tion of telephone conversations where the number of speakers in the recording was known a
priori and hence k-means clustering of i-vectors was performed to arrive at 2 clusters of i-vectors
[40] and [41]. Later, an HAC like clustering of i-vectors for broadcast news were reported in [42]
which demonstrated better performance over traditional BIC based GMM HAC architecture.
ILP based clustering
In an effort by Rouvier and Meigner in 2012, a global optimization approach was proposed to
perform speaker clustering [43] using an integer linear program (ILP). Clustering is posed as a
combinatorial optimization problem on a complete graph (each node connected to every other
node). The speaker segments are considered as nodes of the graph and the incidence matrix is
23
Chapter 3. State of the Art in Speaker Diarization 24
the similarity matrix. The integer linear program to find the optimal clustering is a variation
of the k-centres problem. In simple words, the k-centres problem is to choose K cities out of N
for building warehouses so that the worst case distance between a city and its closest warehouse
is minimized. The ILP is adapted for unsupervised speaker diarization since the number of
speakers (K) in not known a priori.
With the introduction of ILP clustering in the broadcast news domain, diarization systems
now typically perform segmentation and clustering separately [10], and perform a post processing
step of Viterbi decoding using the GMM-HMM framework. Recently in 2014, Meigner et al.
[44] improved upon the above mentioned ILP framework to reduce the redundancies in the
constraints, making clustering extremely fast. Hence, in the recent years, the ILP based methods
are in the limelight over the traditional HAC based clustering due to their better performance in
terms of speed and accuracy. The LIUM toolkit reported a 17.19% DER with GMM based CLR
clustering and 15.46% DER with i-vector based ILP clustering on the REPERE corpus [44].
Summary
In this chapter a few background concepts used in segmentation, clustering and speech activity
detection have been presented, and previously implemented techniques in speaker diarization
literature are reviewed. Other than the speaker diarization task, where the number of speakers
and the presence of specific speakers is not known a priori, there have been many specialized
audio indexing tasks that have been investigated in the past. For example explicitly detecting
the presence of music [45], helping find the structure of a broadcast program [46] or locating
commercials to eliminate unwanted audio [47]. In another work, making use of speech tran-
scriptions, fast speaker change detection was applied [48]. More generic systems specialized for
different domains include the Alize speaker recognition toolkit which has a diarization sub-block,
the SHoUT toolkit [29] that was designed for meeting diarization.. INRIA, IDIAP, and DiarTK
are some other toolkits that are still under development for diarization.
24
Chapter 4
System Description and Evaluation
A complete end-to-end system has been developed in MATLAB that performs speaker diarization
of audio recordings. This system has been tested and evaluated on data from broadcast news
recordings and debate audios from two corpora and using evaluation metrics described in Chapter
2. This chapter gives a complete description of the system right from the feature extraction to
error calculation. The choice of models and parameters is explained through the evaluations
section for each subsystem. The parameters were optimized for broadcast news using the NDTV
dataset as the development set. The system has also been tested on the REPERE corpus for
comparison with broadcast diarization systems previously mentioned in literature.
Meigner et al did a comparative analysis of two approaches for diarization and discussed
the pros and cons of both the step-by-step and integrated diarization systems [49]. In the
step-by-step approach to diarization, the diarization is performed in a single pass through the
speech activity detector, the speaker segmenter and the speaker clustering subsystems. In the
integrated approach to diarization the information from the clustering algorithm is used to refine
the speaker segmentation, and clustering is performed iteratively. With the advent of better
clustering algorithms with i-vector speaker models, the focus has shifted to performing speaker
segmentation and speaker clustering separately for diarization. However use of the GMM-HMM
framework for refinement is still popular [25] Figure 4.1 shows a block diagram of the proposed
system which follows the step-by-step diarization approach.
Figure 4.1: Block diagram of proposed MATLAB system
25
Chapter 4. System Description and Evaluation 26
4.1 Feature Extraction
MFCC features have been repeatedly used in diarization research for all subsystems including
speech activity detection, speaker segmentation and speaker clustering. The proposed system
uses the first 19 MFCC features for processing in all three subsystems. In addition to these
features, short time energy and zero crossing rate and their first and second order derivatives
have been used in the SAD. The speaker segmentation subsystem uses only the 19 MFCCs and
short term energy, whereas the clustering uses their first and second order derivatives as well.
The frame dimensions of the analysis windows are 30 ms with a 20 ms frame overlap.
4.2 Speech Activity Detection
Speech activity detection is the task of separating speech from nonspeech in an audio recording.
While designing a speech activity detector as a precursor to diarization, the challenges faced are
two-fold – achieving (i) minimum missed speech and (ii) minimum false alarm speech. Percent
speech misclassified as nonspeech by the SAD is call missed speech rate (MSR), whereas percent
nonspeech misclassified as speech is called false alarm speech rate (FASR). Indeed these two are
the evaluation metrics for SAD. Typically 1-3% missed speech error and 2-4% false alarm speech
rates are the state-of-the-art in speech activity detection. If the diarization system is designed
to act as a precursor for systems such as ASR or key-word spotting, missed speech errors would
lead to deletion errors in the ASR system, while the false alarm speech might lead to insertion
errors. For SAD as a block in diarization, false alarm speech often leads to contamination of
speaker models during clustering and segmentation and affects the clustering process.
A common approach in speech activity detection is to attempt to classify all types of sounds
that are present in the recording. If the data being diarized is known beforehand and has some
peculiarities, such as audio indicators for marking sections in an episode, it becomes possible
to train statistical models for these markers in the data and classification is straightforward [2].
Unfortunately while developing generic systems, we do not have the luxury of possessing such
markers or sound effects to be expected in the input audio. In this section, the problem of
estimating non-speech models for SAD with no prior information about the data is addressed.
Sounds other than speech which are often seen in audio databases are either human produced
such as fillers, lip-smacks, laughter, clapping etc. or instrument sounds such as music, jingles
etc. Silence regions and pauses taken by speakers also form significant portions of most audio
databases. These sounds together form the non-speech category. Model based approaches are
popular in SAD where statistical models are trained for speech and nonspeech from external
26
Chapter 4. System Description and Evaluation 27
data. The drawback of these systems however is their reliance on acoustic conditions of out of
sample data. Hybrid systems [9, 12] make use of a classifier trained on external data to obtain
an initial bootstrap segmentation and then the models for speech and nonspeech are refined
iteratively over the audio to be segmented to adapt to the acoustic variations in nonspeech.
The bootstrap classification provides a subset of feature vectors of the recording being pro-
cessed that best represent the speech and nonspeech. Class models are initialized over these
token subsets on the feature space and frame-wise iterative classification is performed to refine
the classes. The bootstrap segmentation is generally performed on a smaller set of features which
are chosen based on heuristic information known about the speech and nonspeech classes.
4.2.1 Speech Activity Detection Algorithm
The speech activity detector in the proposed system is a model based classifier. It is independent
of external training data for modeling the nonspeech and speech classes. The approach to such
a model based speech activity detector is inspired by the SAD in the IIR-NTU submission to the
NIST RT2009 evaluations [13]. In our system speech activity detection is done in two decoupled
steps. First, silence is removed from the whole recording using an energy based bootstrapping
followed by iterative classification. In the second step, music and other audible nonspeech are
identified from the recording. For music removal the silence removed audio is fed to a music vs.
speech bootstrap discriminator. The frames of the audio which are music with a high confidence
are used to train a music model which is iteratively refined. In both steps, only segments with
duration 1s or longer have been labeled as nonspeech in order to avoid sporadic nonspeech
to speech transitions. This constraint are incorporated in [25] and [29] using a GMM-HMM
framework.
Silence Removal
The silence removal in the proposed system is done using 19 MFCC features concatenated with
short time energy(STE) and their first and second derivatives. A bootstrap segmentation assigns
a confidence value to every frame for both silence and speech classes. The bootstrap silence model
is trained using a Gaussian mixture of size 4 over the 60 dimensional feature space. A speech
model is also trained with the same size from high confidence speech frames.
In an iterative classification step, each frame is classified into two classes viz. speech and
silence. The high confidence speech and silence frames from these are used to train the speech
and silence models for the next iteration. As the number of iterations increase, the number of
60 dimensional Gaussians used to model the speech and silence GMMs are increased until a
27
Chapter 4. System Description and Evaluation 28
Figure 4.2: Silence removal using energy based bootstrapping and iterative classification
28
Chapter 4. System Description and Evaluation 29
maximum. The best results were obtained when the size of Gaussian mixtures was limited at 32
for speech and 16 for nonspeech. This results in removal of silences and pauses, but high energy
nonspeech, also called audible nonspeech such as and jingles and music are classified as speech,
since the MFCCs and frame energy for music resemble speech more than silence.
Music removal
In 2005 [28] introduced a model fitting based music v/s speech classifier that reported a clas-
sification accuracy of 95%. The authors pre-segmented the audio recording into chunks of 1s
and extracted 50 feature vectors over 20ms windows. These feature vectors were 2 dimensional
- (i) short time energy and (ii) zero crossing rate of the windowed signal. A histogram of short
time energy(STE) and zero crossing rate(ZCR) is computed for the 1s chunk and compared with
the model histograms of speech and music which were derived from a large database of music
data and speech data. These ideal histograms were modeled with χ2 distributions. The chunk
is labeled music or speech after a comparison between the histogram of the concerned 1s chunk
and the χ2 models.
The music speech discriminator [28] fails when speech and music are present together. In
news broadcast archives, it is often the case that the most information dense parts of the archive
such as episode headlines have a characteristic background music that is specific to the show.
Taking this into consideration, porting the system as is would not just result in missed speech,
but would cause loss of highly informative speech data. Hence we use the output of the classifier
as a bootstrap segmentation. Initial estimate models for music and speech are trained from
high confidence frames of both classes. An iterative classification similar to the silence removal
system is done to refine the speech and music classes so as to discard music only segments. The
features used are nineteen MFCCs concatenated with the zero crossing rate and their first and
second derivatives. The short time energy is not used during the iterative classification step. It
was observed that only after neglecting the short time energy the speech with background music
which was classified as music was recovered to the speech class.
4.2.2 Confidence measures for Speech Activity Detection
During the silence removal, a histogram of the energy of the frames is used to rank all frames
according to the energies. The frames with 20% lowest energies are called high confidence silence
frames whereas the frames with 10% highest energies are speech with a high confidence. Hence
in every iteration, only these frames are used for the training of the GMMs. For the music
removal, the aim is to rake out from the frames that have speech with music as background but
29
Chapter 4. System Description and Evaluation 30
Figure 4.3: Music removal using a music-speech discriminator for bootstrapping
30
Chapter 4. System Description and Evaluation 31
that are classified as nonspeech. Hence we only take the 40% highest zero crossing rate frames
from the ZCR histogram as high confidence music frames and train the music model.
4.3 Evaluation of Speech Activity Detection
The NDTV dataset was used as a development set to tune parameters for both silence and music
removal. These parameters were used to obtain results on the REPERE dataset as well.
In this section, results for 3 separate experiments has been shown. The effect of the size of
the GMM in the iterative clustering step of SAD, the shape of the covariance matrix of the GMM
and effect of cascading the 2 systems. Error has been obtained using the Pyannote [5] library in
python. Missed speech rate (MSR) is the percentage of time of the audio for which speech was
misclassified as nonspeech. Whereas, false alarm speech rate (FASR) is the percentage of time
the audio was misclassified as speech. The SAD error is the sum of the MSR and FASR.
4.3.1 Evaluation on the NDTV dataset
Since the two blocks of silence and music removal are decoupled, the set of system parameters
is chosen for the former and the output from this bock is fed to the music removal. Hence the
evaluation of the music removal is done separately.
Size of GMM in Iterative Clustering during Silence Removal
After the bootstrap segmentation, the models trained from limited speech and nonspeech frames
are GMMs of size 4. As the amount of data available to both the classes increases, the size of the
GMMs used to model the speech and silence is increased. In every iteration the size is doubled
until it reaches a maximum. In this experiment, we choose the size of the GMM that is best
suited for both the classes. The table shows % SAD error over 22 episodes of NDTV dataset.
Best results of were obtained with the 32GMM for speech and 16GMM for silence. For the
combination (32,64) the iterations did not converge and gave a high MSR in every iteration. A
possible justification is that the silence model is a mixed model with some Gaussian components
representing speech frames and others representing silence frames. For the combination (4,64)
the FASR was very high, since of the two competing models, the speech model always resulted
in a high likelihood for most frames and hence managed to capture silence through some of its
Table 4.5: Results for REPERE with 60 hours annotated
4.4 Speaker Segmentation
The speaker segmentation algorithm used in the proposed system is a growing window search
[21] using the ∆BICdistance as shown in Figure3.4. Starting from the beginning of the audio,
a search is done for a single speaker change. At every change found, the search is restarted from
the next frame. The search window is initialized to 5s and a ∆BICvalue is computed for each
frame in the window. If the maxima of this array becomes greater than a threshold θ, then a
33
Chapter 4. System Description and Evaluation 34
change is declared at the point of maxima. If no such maxima is located in the window, the size
of the window is increased by 2s and the same procedure is carried out until a change is detected.
However, only speech frames are processed after discarding nonspeech indicated by the speech
activity detector. After finding the change points in the speech frames, their corresponding
locations in the original audio are found and declared as change points.
In two previous broadcast diarization toolkits [25] and [29] the segmentation is carried out in
2 steps, first the ∆BICbased change detection is performed as mentioned above with threshold
as 0 and then merging of consecutive segments for which the ∆BICscore is positive. The need for
the two steps is due to the oversegmentation from the zero threshold ∆BICbased segmentation.
In order to avoid the two step process, only the maxima which were greater than a threshold θ
were chosen. This significantly reduced the oversegmentation.
∆BIC(xi) = Nlog|Σ| −N1log|Σ1| −N2log|Σ2| −λ
2(d+
1
2d(d+ 1))logN (4.1)
For the speaker segmentation, 19MFCCs with their short time eneries have been used. The
segmentation algorithm varies as O(d6) where d is the number of dimensions. Hence most
speaker segmentation systems [21, 25, 29] do not make use of derivatives of cepstral features
while performing segmentation.
4.5 Choice of Segmentation parameters
The parameters for segmentation were tuned on the NDTV dataset, in order to minimize the
error in diarization error rate. The parameters tuned were θ and λ in equation 4.1. The effect of
the parameters on the DER was studied in two experiments. First, the DER was calculated in
an ORACLE experiment [29], where the system was given the annotation for the speech activity
detection and clustering and only the speaker segmentation information from the system was
used. These experiments were inspired by those performed for the SHoUT toolkit. This tested
the system for the missed segments only, since the segments which were small and were caused
due to false alarm speaker changes would get labeled correctly in the ORACLE.
λ1 10
θ0 0.89 1.24
1000 1.40 2.372000 2.55 3.75
Table 4.6: DER with ORACLE experiment
34
Chapter 4. System Description and Evaluation 35
The effect of the false alarm speaker changes is that it increases the number of segments
and hence the size of segments is reduced. To test the effect of false alarm segments, the
best algorithm from the clustering was used (ILP clustering with i-vectors) and the DER was
calculated.
It was observed that low values of θ and λ in result in an oversegmentation. As the θ is
increased the average duration of segments increases. This enabled better speaker modeling for
the segments and resulted in lower DER when combined with the best clustering algorithm.
λ1 10
θ0 31.52 33.41
1000 23.67 16.542000 12.35 16.59
Table 4.7: DER with best clustering algorithm
Hence the combination of θ = 2000 and λ = 1 is the default in the proposed system.
4.6 Speaker Clustering
After the speaker changes have been detected using the speaker segmentation, the speaker clus-
tering subsystem aims to gather together segments from the same speakers. For this the segment
is represented with a speaker model. Pair-wise similarity is computed between all the speaker
models and a clustering algorithm is chosen to perform the grouping of segments.
4.6.1 Choice of speaker model
The proposed system has implementations of two speaker models which have been widely studied
in speaker verification and speaker recognition tasks (i) Gaussian Mixture models and (ii) i-vector
models. The GMM (equation 3.1) is a probabilistic model on the feature space. The features
used here are short time energy concatenated with 19 MFCC features and their first and second
derivatives in a 60 dimensional feature space. The similarity between GMMs is based on cross
likelihoods of model of one segment fitting the data in the other. GMM for a segment is trained
on the feature vectors of the segment using the Expectation-Maximization algorithm to obtain a
diagonal covariance GMM of size 32. While evaluating the system using GMM speaker models,
the CLR and NCLR distances (equations 3.2 and 3.3) have been tested along with HAC and
ILP clustering algorithms.
35
Chapter 4. System Description and Evaluation 36
i-vectors are vectors in Rn, were n is of there order of 100. The similarity measures are the
ones used with euclidean vectors viz. cosine distance, Mahalanobis distance etc. The i-vector
is derived from the GMM supervector 1 of the segment after a dimensionality reduction using
factor analysis [36]. The components of the i-vector are the speaker factors of the eigenvoice
vectors of the Total Variability space. The i-vector extraction process is explained below.
i-vector extraction
To obtain the i-vectors, first a speech Universal Background Model (UBM) is trained on a
training data. The UBM is a GMM with large number of gaussians, so that it captures all
possible variabilities in speech in the feature space. In the proposed system the TIMIT and
TIFR datasets have been used for the UBM training. The TIMIT set consists of 168 speakers
uttering 10 English sentences each while the TIFR set consists of 100 speakers uttering 10 Hindi
sentences each, both from native speakers of the respective languages. The UBM is a diagonal
covariance GMM of size 512. UBM training is a one time computation. The UBM is mean-
adapted for the feature vectors of the concerned segment to obtain a GMM for the segment. The
means of the UBM and the adapted segment GMM are concatenated together to get a 30720
sized supervectors (60x512).
The Total Variability space is a subspace of the GMM superspace, that captures all the
speaker and channel related information. T is the low rank matrix whose columns span the
Total variability subspace. For the proposed system, the matrix T is trained using the same
speaker labeled dataset used for UBM training. The T matrix training is also a one time
computation. The i-vector of the segment is the projection of the GMM supervectors onto the
Total Variability subspace.
m = M + Tx (4.2)
where M is the UBM supervector, m is the mean-adapted GMM supervector of the segment.
Thus for every segment, extraction of the i-vector x involves 2 steps – adapting the UBM to
obtain its GMM supervector and extracting the factors of the total variability eigenvectors to
get. The algorithm for training the T matrix from speaker labeled training data is detailed in
[36]. The proposed system uses the MSR Identity toolbox [37] for UBM training, training of the
TV subspace and the i-vector extraction.
While evaluating the system using i-vectors, the dimension of the TV subspace and the
choice of distances between the i-vectors has been examined. Two distance metrics have been
1Supervector is the vector obtained after concatenating all the mean vectors of the GMM
36
Chapter 4. System Description and Evaluation 37
Figure 4.4: Extraction of i-vectors
tested for measuring similarity between i-vectors - the cosine similarity metric (eq. 4.3) and
the Mahalanobis distance metric (eq. 4.4), where W in is the within class covariance matrix
determined from the n training i-vectors from S speakers detailed in 4.5. The Mahalanobis
distance is hence also called within class covariance normalization (WCCN). In equation 4.5 of
WCCN computation for the Mahalanobis distance, the vectors ws are mean of the ns i-vectors
of speaker s
D(x, y) = 1− xT y
‖x‖ · ‖y‖(4.3)
D(x, y) = (x− y)TW−1(x− y) (4.4)
W =1
n
S∑s=1
ns∑i=1
(wsi − ws
)(wsi − ws
)T(4.5)
4.6.2 Choice of clustering algorithm
The proposed system is equipped with two clustering algorithms – the traditional hierarchical
agglomerative clustering algorithm and a graphical clustering algorithm called the ILP algorithm
that was recently introduced to speaker diarization in 2013 by Rouvier et al. [43].
HAC
In the HAC system, in every iteration the most similar speaker models are chosen to be merged.
In the merging step, a new speaker model is estimated using the data from all segments of
both speaker models being merged. The similarity matrix is updated with similarity of the new
speaker model from other models. In every iteration the number of clusters reduces by one.
The process is continued until only one cluster remains. The optimum set of clusters is chosen
from among the outputs of every iteration based on an optimality criterion. The HAC can be
implemented using either speaker model. In each iteration of the HAC, the merging requires
37
Chapter 4. System Description and Evaluation 38
Figure 4.5: ILP clustering on a complete graph of speaker models [44].
extra computations since the model needs to be retrained, and the distance matrix needs to be
updated with entries corresponding to the merged speaker model.
There are two optimal cluster criteria which have been implemented for the HAC. With the
distance threshold criterion, where the iteration with minimum pairwise similarity greater than
a threshold is declared the optimal set of clusters (equation 3.5). In the Ts optimal cluster
criterion introduced by Nguyen [38], the set of clusters with minimum intra-cluster similarity
and maximum inter-cluster similarity is declared the optimal set of clusters using the formula
3.6
ILP clustering
In the ILP clustering, the k-centres problem is modified to obtain a set of clusters. The original
k-centers problem is to identify K cities out of N cities for building warehouses such that the
longest distance between a city and its nearest warehouse is minimized. In the case of the ILP
formulation, the N segments are similar to N cities and K of those segments are to be chosen
as best segments for the speakers data of the K speakers. However in case of diarization, the
number of speakers (K) is unknown. Hence the modification is brought out in the objective of
the optimization problem 4.6
Consider the set of binary decision variables: Xii
Xii= 1 indicates cluster i is a leader cluster.
Xij= 1 indicates cluster i is assigned to leader cluster j (and hence Xjj= 1 is necessary)
Note that Xji=1and Xij=1 have a different meaning although they both show that i vectors for
38
Chapter 4. System Description and Evaluation 39
segments i and j belong to the same cluster. Now, consider the optimization problem 4.6
min
N∑i=1
Xii +1
δ
N∑i=1
N∑j=1
dijXij
s.t.∑
Xij = 1 ∀j
Xij ≤ Xii ∀j
dijXij ≤ δ ∀i, j
Xij ∈ {0, 1} ∀i, j
(4.6)
The objective function to be minimized consists of 2 terms, the first is number of leader
clusters (number of speakers), and the second is the total dispersion of all K clusters. The first
constraint ensures that a segment is assigned to exactly 1 cluster. The second constraint ensures
that a cluster centre is assigned to the same cluster. Constraint 3 prevents assigning a vector to
a cluster farther than a threshold from its leader cluster.
Note that the ILP clustering algorithm does not require any information about the objects
being clustered and depends only on the similarity matrix. The integer program needs to be
converted to a 1-D ILP so that the intlinprog solver in MATLAB can generate a set of clusters.
The only disadvantage of using the ILP clustering algorithm is that the speaker models are
not refined iteratively in the process of obtaining the speaker models as in HAC and hence if the
sizes of the segments are small then the i-vectors chosen as leader i-vectors may not represent the
speaker information completely. This clustering algorithm can be used along with either speaker
model - GMMs or i-vectors since only the similarity matrix is needed to obtain the optimal set
of clusters.
4.7 Evaluation of Speaker Clustering
This section describes clustering experiments performed using the best outputs from the pre-
vious stages of SAD and segmentation. Section 4.6.1 and 4.6.2 present the experiments on the
traditional hierarchical clustering using GMM speaker models and i-vector speaker models re-
spectively. Section 4.6.3 and 4.6.4 present the experiments using the ILP clustering algorithm
on the GMM speaker models and i-vector speaker models in that order.
The experiments presented below were performed on the NDTV dataset. The DER values
presented are the overall diarization error rates, which are averages of the individual DER per
episode weighted with the duration of the episode.
39
Chapter 4. System Description and Evaluation 40
Figure 4.6: DER on NDTV dataset: HAC with distance threshold for GMM speakermodels
4.7.1 HAC experiments with GMM speaker models
Hierarchical clustering is performed on a similarity matrix between clusters, and requires a
stopping criterion which decides the optimal set of clusters. Two stopping criteria have been
implemented – (i) distance threshold criterion and (ii) Ts optimal stopping criterion.
Distance threshold criterion
During HAC, as the iteration number increases, the underclustering decreases until the mini-
mum DER clustering is reached. The iterations following the set of optimal clusters, there is
overclustering. This is demonstrated in the graphs below.
Ts optimality criterion
Using the optimality criterion of Nguyen [38] given by equation 3.6, the cluster with farthest
histograms for inter-cluster distances and intra-cluster distances was chosen. Using the NCLR
distance a DER of 22.15% was attained, whereas using the CLR resulted in a DER of 19.83%
4.7.2 HAC with i-vector speaker models
Using HAC with i-vectors. New i-vectors were extracted for every segment obtained in the
cluster merging step. The best result obtained was 16.69% DER for 75 dimensional TV space
with the Mahalanobis distance.
40
Chapter 4. System Description and Evaluation 41
Figure 4.7: DER on NDTV dataset: HAC with distance threshold for GMM speakermodels
4.7.3 ILP based experiments with GMM speaker models
ILP clustering was performed using CLR and NCLR distance to construct the distance matrices.
Best result obtained was 19.03% whereas when the NCLR distance was used, the best result
obtained was the 17.27%. The x-axis denotes the threshold present in the ILP optimization
problem’s constraints. The better performance of ILP than 3.6 optimum criterion is in concur-
rence with [50] The NCLR is a better representation of the distance than the CLR, however it
is not suitable for use in HAC, since as the size of the merged cluster increases, the size of the
segment plays a role in decreasing its NCLR distance from other segments 3.2 & 3.3.
The Integer Linear Programming formulation on the other hand offers a holistic trajectory
to reach the optimum clustering. To verify this, the ILP formulation was implemented for the
CLR and NCLR similarity matrix generated using the GMM speaker models and it gives an 11%
relative improvement in the error compared to the best error from the GMM-HAC clustering
algorithm. In literature the ILP has only been tried using i-vectors.
41
Chapter 4. System Description and Evaluation 42
Figure 4.8: DER on NDTV dataset: ILP with distance threshold for GMM speaker models
4.7.4 ILP clustering with i-vector speaker models
The ILP clustering was implemented with i-vectors trained on the TIMIT+TIFR dataset. The
following experiments indicate the best dimensions for the Total Variability subspace and the
best choice of distance.
The Mahalanobis distance offers a background compensation method that enhances similar-
ity between segments from same speaker but different background
GMM v/s i-vector
The GMM based speaker modeling gives a very high dimensional representation for the segment
and hence also captures background information as well. Similarity in background could lead to
similarity between segments of different speakers. Background compensation schemes need to be
employed on the features space. On the other hand the i-vectors allow background compensation
through WCCN. Another issue with using GMM speaker models is their high computation time
for segment similarity due to the cross likelihood terms in equations 3.2 and 3.3.
HAC v/s ILP
The HAC though a greedy algorithm for clustering, works as a good approximation. However if
during a step in the clustering, an erroneous merging occurs, it affects the performance of the
later steps significantly. Since a re-estimation of the cluster needs to be done at each step, HAC
is more expensive than ILP. The ILP does a more thorough search by exploring all possible∑NK=1
(NK
)= 2N cluster combinations than the HAC.
42
Chapter 4. System Description and Evaluation 43
Figure 4.9: Performance of ILP clustering with i-vector speaker models with varyingdimensions of the Total Variability subspace. Red plot is for the Mahalanobis similarity.Blue plot for the Cosine similarity
43
Chapter 4. System Description and Evaluation 44
Table 4.8: Best results from the 2 speaker models and 2 clustering algorithms
HAC ILPGMM 19.45 17.27i-vector 17.11 16.18
4.7.5 Results on REPERE corpus
Previously indicated results for the dev0 subset of the REPERE show a 17.19% DER with
GMM speaker models and 15.46% DER with the i-vector speaker models. For the dev0 subset,
we achieved a 23.19% DER with the HAC-GMM clustering and a 21.02% DER with ILP-i-
vector clustering. The poorer performance compared to the previously attained results could be
because of smaller sized UBM models (2048 as used by LIUM [25]).
The overall DER for the 60 hour REPERE corpus is best for ILP-ivector clustering combi-
nation i.e. 24.4%.
Summary
In this chapter the proposed system and its components were described. The system has been
equipped with state of the art clustering algorithms and speaker models. The system has been
built using MFCCs as foremost feature vectors in every component, although other feature
vectors may be endeavoured. A completely unsupervised speech activity detection algorithm
has been implemented in the system that can be ported for other speech processing tasks.
The speech activity detection uses an existing music vs speech discriminator for building the
nonspeech models from the recording.
44
Chapter 5
Conclusion and Future work
5.1 Conclusion
The aim of this thesis was to study the state-of-the-art techniques in speaker diarization for
specific application to broadcast news audio recordings and develop a MATLAB based system
for the same. The proposed system has been evaluated using the diarization error rate metric
(detailed in Chapter 2) and presented with new additions in unsupervised speech activity de-
tection. The system has 3 main components viz. speech activity detector, ∆BIC based speaker
change detector and a state-of-the-art speaker clustering block. The system has been evaluated
for two news databases - NDTV dataset and the REPERE dataset.
The general purpose speech activity detector is capable of removing silences as well as audible
nonspeech such as music from a recording. The speaker clustering block allows for state-of-the-
art speaker models for representing segments with i-vectors, which can facilitate further work in
fast cross-show diarization.
Experiments were performed on two broadcast news corpora – Indian news dataset from
NDTV and the French REPERE corpus. The NDTV corpus is a 4h15m dataset from one news
show. This dataset was manually annotated for the diarization experiments. The REPERE
dataset of 60h04m was obtained from the French ELDA.
The system is capable of performing speech activity detection without dependence on ex-
ternal training data for nonspeech and speech models. Frame energy and zero crossing rate
have been used as bootstrapping features to construct silence and music models from the audio
recording being processed. A competitive speech activity detection has been achieved with a
two-stage SAD system – a silence detection, followed by a music detection. The results are
comparable to a state-of-the-art GMM-HMM based speech activity detector which uses external
45
Chapter 5. Conclusion and Future work 46
training data from a large dataset for creating nonspeech models.
The i-vector speaker models, which are now state-of-the-art in speaker verification, provide
a low dimensional representation of the speaker information compared to traditional GMM
speaker models. They also offer a computational advantage since distance computation between
i-vectors is much faster compared to cross-likelihood based similarity computation on GMM
speaker models. Hence for real-time diarization systems, i-vectors seem more appealing.
It has been verified in this thesis as indicated in [43] that speaker clustering is achieved
better using a global optimization approach to reach the optimum set of speaker clusters rather
than the traditional greedy optimization approach of the hierarchical agglomerative clustering
(HAC) algorithm. HAC is computationally very expensive, and an erroneous merge step during
the clustering significantly affects the later iterations i.e., error gets propagated. The integer
linear programming (ILP) clustering formulation on the other hand offers a holistic trajectory
to reach the optimum clustering. It is a graphical approach to clustering adapted from the
prevalent k-centres problem in combinatorial optimization. To verify the better performance of
ILP compared to HAC, the ILP formulation was implemented for the CLR and NCLR similarity
matrix generated using the GMM speaker models and it gives an 11% relative improvement in
the error compared to the best error from the GMM-HAC clustering algorithm. In literature
the ILP had only been tried using i-vectors.
5.2 Future Work
Future work on the system development should focus on the following aspects of speaker diariza-
tion:
Refinement of the diarization output by passing it through a Viterbi decoder should be
attempted.
Cross-show diarization is the task of performing speaker clustering across different recordings
to identify segments of the same speakers in different shows. Current momentum of diarization
research is along solving this problem for large databases. Cross diarization should be attempted
using the proposed MATLAB system
Improvements in ILP have shown sufficiently faster implementations by reducing the re-
dundancies in the original ILP, although MATLAB does not support solving these optimization
problems. Solvers such as GUROBI provide support to solving advanced integer linear programs.
It was observed that during the speaker clustering, the segments having background music
were unable to show similarity with segments having clean background using the MFCC-GMM
speaker models owing to the low SNR. Even after using background variability compensation
46
Chapter 5. Conclusion and Future work 47
techniques on i-vector speaker models, the problem persists. Speech enhancement and singing
voice separation prior to parameterising the audio recording should be attempted so that music
in the background of a speaker is suppressed.
47
Bibliography
[1] Inside the secret technology that makes ‘the daily show’ and ‘last week tonight’ work,