Introduction to Speaker Diarization

Institute of Information Science, Academia Sinica, TaiwanInstitute of Information Science, Academia Sinica, Taiwan

Introduction to Speaker Introduction to Speaker DiarizationDiarization

Date: 2007/08/16Date: 2007/08/16Speaker: Shih-Sian ChengSpeaker: Shih-Sian Cheng

2

OutlineOutline Speaker diarization Problem formulation A prototypical speaker diarization system Speaker segmentation Problem formulation Speaker segmentation using a fixed-size analysis window Speaker segmentation using a variable-size analysis window Bottom-up segmentation using BIC Top-down segmentation using BIC Speaker clustering Problem formulation Hierarchical agglomerative clustering Optimization-oriented approaches Two leading speaker diarization systems LIMSI’s system Cambridge’s system

3

Speaker diarization (Problem formulation)Speaker diarization (Problem formulation)

Problem formulation: the “who spoke when” task on an continuous audio stream (NIST RT03 Spring Eval.)

speaker segmentation

Speaker 1 Speaker 2 Speaker 3

speaker clustering

4

Speaker diarization (Problem formulation)Speaker diarization (Problem formulation)

Performance measure of the speaker diarization task (C. Barras et. al., 2006 ; NIST RT03 Spring Eval.)

Applications

Find the mapping between reference speakers and hypothesis speakers such thattheir overlapping in time is largest. In this case, S1->A and S3->B.

5

Speaker diarization (Problem formulation)Speaker diarization (Problem formulation) Example: Automatic transcription for a broadcast news show

By speaker recognition

Speaker adaptation+ speech recognition

6Speaker diarization (A prototypical Speaker diarization (A prototypical system)system)

Change boundary refinement

Speaker segmentation (usually, over segmentation)

To filter out non-speech data

The prototypical speaker diarization system (S. E. Tranter & D. A. Reynolds, 2006)

Speaker clustering

7Speaker segmentation (Problem Speaker segmentation (Problem formulation)formulation)

detect the speaker change boundaries

Problem formulation

Performance measure

Target changes

Hypothesized changes

false alarm

miss detection

Error type: miss detection & false alarm

Performance metric: ROC curveROC curve: F-score:

RP

PRF

2

P: precision rate

R: recall rate

8Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)

Data stream

Distance computation

Sliding windows

Distance curve

Speaker segmentation using a fixed-size analysis window (Siegler et. al., 1997)

Distance measure of two segments

Kullback-Leibler (KL) distance (Siegler et. al., 1997)

9Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)

SVM training error ( 王駿發 et. al., 2005)

YX

YX

More overlap, larger training error larger distance, less similarity

10

Bayesian information criterion (BIC) for model selection:

• Data set: dn RxxxX },,,{ 21

• Candidate models: },,,{ 21 kMMMM

• Model selection by BIC:

,log)(#2

1)ˆ|(log)( nMXprMBIC iii

λ=1 in the BIC theory, but is usually tuned for trade-off between error types; maximum likelihood of X for model ; : the number of parameters of ;

)ˆ|( iXpr iM

)(# iM

Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)

ΔBIC (S. Chen et. al., 1998; P. Delacourt et. al., 2001)

iM

11

Use BIC as an inter-segment distance computation

n)(mH|-λΣ| n

|Σ| m

|-Σ| nm

HBICHBICBIC

NyyyNxxxH

NyyyxxxH

YX

YYnXXm

nm

log)(#log2

log2

log2

)()(

),(~...,, );,(~,...,,:

),(~...,,,,...,,:

0

01

,21211

,21210

Given two audio segments represented by feature vectors and these two segments can be judged as under the same or different acoustic conditions via the following hypothesis test:

X and Y are judged as from the same acoustic condition if BIC <0.

},...,,{ 21 myyyY },...,,{ 21 mxxxX

Seg X

Seg Y

0H1H

Seg X

Seg Y

0H

1H

Ex:

X and Y are from different acoustic conditions, BIC>=0

X and Y are from the same acoustic condition, BIC<=0

Speaker segmentation (Fixed-size analysis Speaker segmentation (Fixed-size analysis window approach)window approach)

12Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)

The bottom-up detection process on an audio stream

Seg 4Seg 1 Seg 2 Seg 3Audio stream

One-change- point detection

Seg 4Seg 1 Seg 2 Seg 3

Change point

Bottom-up detection using BIC (S. Chen and P. Gopalakrishnan, 1998; M. Cettolo et. al., 2005 )

Speaker segmentation using a variable-size analysis window

13

One-change-point detection using BIC

X YCalculate at each feature vectorBIC

BIC

Feature vectors X Y

Calculate at each feature vectorBIC

BIC

Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)

14Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach) Top-down detection using BIC (Top-down detection using BIC (C. H. Wu and C. H. Hsieh, 2006; ; M. Cettolo et. al., 2005 ))

The top-down detection process for an audio stream

multiple-change-detection

Seg 4Seg 1 Seg 2 Seg 3Audio stream Seg 4Seg 1 Seg 2 Seg 3

15

Multiple-change-detection using BIC


H0 :

H1 :

H2 :

H3 :

Assumption: different segments arise from different Gaussian processes

X

pr(X| H0)<pr(X| H1)<pr(X| H2)<pr(X| H3)

Intuitively,

but,

BIC(X|H2)>BIC(X| H3)>BIC(X| H1)>BIC(X| H0)

Multiple-change-detection: Search the H that has the largest BIC value in the solution space

• Exhausted search

Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)

16Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach)

• Top-down, hierarchical search (C. H. Wu and C. H. Hsieh, 2006)


Pass1:

X

Pass2:

Terminate

• Dynamic programming (M. Cettolo et. al., 2005 ) An optimal search

An sub-optimal search

17

Speaker clustering (Problem formulation)Speaker clustering (Problem formulation)

Problem formulation

given N speech utterances from P unknown speakers, partition these utterances into M clusters, such that M = P and each cluster consists exclusively of utterances from only one speaker

Partitioning

Speech Utterances Clusters

Speaker 3 Speaker 4

Speaker 1 Speaker 2

18

Cluster PurityThe probability that if we pick any utterance from a cluster twice at random, with replacement, both of the selected utterances are from the same speaker

0.4210

62112

2222

1.06

62

2

0.254

11112

2222

P : total no. of speakers involved,M : total no. of clusters, m : purity of the m-th cluster,

nm* : no. of utterances in the m-th cluster,

n*p : no. of utterances from the p-th speaker,

nmp : no. of utterances in the m-th cluster that are from the p-th speaker

,1

1

22

*

P

pmp

mm n

n . '

1 * NnM

m mm

Increases as the number of clusters increases


19

speaker

cluster

1 2 … M Sum

1 n11 n21 … nM1 n1

2 n12 n22 … nM2 n2

… … … … … …

P n1P n2P … nMP nP

Sum n1 n2 … nM N

Rand Index

M

m

P

pmp

P

pp

M

mm

M

m

P

pmp

P

pp

M

m

P

pmp

M

mm nnnnnnnMR

1 1

2

1

2*

1

2*

1 1

2

1

2*

1 1

2

1

2* 2)(

Two error types:I: The number of utterance pairs (with replacement) in the same cluster but from different speakersII: The number of utterance pairs (with replacement) from the same speaker but in different clusters

The number of utterance pairs from the same speaker that are in the same cluster

The number of utterance pairs from the same speaker

M

m

P

pmp

P

pp nn

1 1

2

1

2*

Type II error:

The number of utterance pairs from the same cluster and are in the same cluster

The number of utterance pairs from the same cluster

M

m

P

pmp

M

mm nn

1 1

2

1

2*Type I error:


Reaches its minimum only when M = P

20Speaker clustering (Hierarchical Speaker clustering (Hierarchical agglomerative clustering)agglomerative clustering) Hierarchical agglomerative clustering (S. Chen and P. Gopalakrishnan, 1998; Barras et. al., 2006)

X2 X13

X13X1 X2 X14 XN

X1 X19 XN

X1 X19 X2 X13 XN

X1 XNX2

Distance of two clusters: ΔBIC

Stopping criteria: Local BIC Global BIC

21Speaker clustering ( Optimization-oriented Speaker clustering ( Optimization-oriented approaches )approaches )

Optimization-oriented approaches

For a given number of cluster and a set of cluster indices H = [ h1, h2, …, hN ] for N utterances X1 , X2 ,…, XN , the average cluster purity is

oi is the true speaker index of utterance Xi, (1 oi P )

Maximum purity clustering (W. H. Tsai et. al., IEEE Trans. ASLP, 2007)

M

mN

i i

N

i

N

j jijiM

m

P

p m

mpm

mh

oomhmh

Nn

nn

N 11

1 1

1 12

2

),(

),(),(),(11)'(H

(oi , oj ) (the ground truth) is unknown and needs to be estimated.

otherwise ,0

, )],([ and , if ,),(

),( if ,1

),(ˆmji

ii

jiji nSRji

S

Sji

oo XXXX

XX

(oi, oj) is approximated by

S(Xi,Xj): similarity between utterances Xi and Xj

R[S(Xi,Xj)]: rank of inter-utterance similarity S(Xi,Xj) among

S(Xi,X1), S(Xi,X2), …, S(Xi,XN) in descending order

i : utterance most similar to Xi, i.e., R[S(Xi,Xi)] = 2.

mth-cluster ; nm=4 jXiX

…

22

Use BIC to determine the cluster number

Let denote the estimated purity. Use Genetic Algorithm to find H* such that )('maxarg* HH

H

)'(H

Speaker clustering ( Optimization-oriented Speaker clustering ( Optimization-oriented approaches )approaches )

Minimum rand index clustering (W. H. Tsai and H. M. Wang, Proc. ICASSP, 2007): Performing the grouping of utterances and determining the group number at within the optimization process

M

m

P

pmp

P

pp

M

mm nnnMR

1 1

2

1

2*

1

2* 2)(

N

i

N

j

Mj

Mi hh

1 1

)()( ),(

N

i

N

jji

Mj

Mi oohh

1 1

)()( ),(),(2

constant

(oi , oj ) (the ground truth) is unknown and needs to be estimated.

23Speaker clustering ( Optimization-oriented Speaker clustering ( Optimization-oriented approaches )approaches )

Use Genetic Algorithm to find H* such that

N

i

N

jji

Mj

Mi

N

i

N

j

Mj

Mi

M oohhhhR1 1

)()(

1 1

)()()( ),(ˆ),(2),()(ˆ H

)(ˆminarg )(

1,

*

)(

M

NM

RM

HHH

(oii,ojj) is approximated by a normalized inter-utterance similarity:

jiSS

jioo

jiji if ,),(

if , 1),(ˆ

maxXX

)|Pr()|Pr(

)|Pr(),(

jjii

ijijjiS

XX

XXX

Smax is the maximum among the similarities S(Xi, Xj), i j.

where(Generalized likelihood Ratio)

24Two leading systemsTwo leading systems

LIMSI’s system (Barras et. al., 2006)

Fixed-size sliding window segmentation

Boundary refinement

Use ΔBIC to measure the inter-cluster similarity

,

To filter out short-duration silence segments that were not removed in the initial speech detection step

To remove only long regions without speech such as silence, music, andnoise using GMM

Use the cross-likelihood ratio,

to measure the inter-cluster similarity.Mi is a MAP-adapted GMM .

Boundary refinement; Align the changeboundaries to silence portions

25Two leading systemsTwo leading systems

Cambridge’s system (Sinha et. al., 2005)

SD: speech detection

Speaker identification (SID) clustering:MAP adaptation (mean-only) was applied towards each cluster from the appropriategender/bandwidth UBM.Use the cross likelihood ratio (CLR) betweenany two given clusters.

CPD: change point detection

IAC: iterative agglomerative clustering

26

ReferenceReference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization

of Broadcast News,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006.

NIST 2003 Spring, http://www.nist.gov/speech/tests/rt/rt2003/spring/ R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University

March 2005 Speaker Diarization System,” INTERSPEECH 2005. S. E. Tranter & D. A. Reynolds, “An Overview of Automatic Speaker Diarisation

Systems,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006.

S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.

C. H. Wu and C. H. Hsieh, “Multiple Change-Point Audio Segmentation and Classification Using an MDL-based Gaussian Model,” IEEE Transactions on Audio, Speech and Language Processing, 2006.

M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech and Language, 2005.

M. Siegler, U. Jain, B. Raj and R. Stern, “Automatic Segmentation, Classification and clustering of broadcast News Audio,” in Proc. DARPA Speech Recognition Workshop, 1997.

P. Delacourt and C. J. Welkens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing", Speech Communication, vol. 32, pp 111-126, 2000.

王駿發 , 林博川 , 王家慶 , 宋豪靜 , “ 以支援向量機為基礎之新穎語者切換偵測演算法 ,” in Proc. ROCLING 2005.

27

ReferenceReference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization

of Broascast News," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no.5, pp. 1505-1512, 2006.

Wei-Ho Tsai, Shih-sian Cheng, and Hsin-min Wang, "Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation," IEEE Trans. on Audio, Speech, and Language Processing, volume 15,

number 4, pages 1461-1474, May 2007. Wei-Ho Tsai and Hsin-min Wang, "Speaker Clustering Based on Minimum Rand

Index," IEEE Int. Conf. Acoustics, Speech, Signal processing (ICASSP2007), April 2007.

R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” INTERSPEECH 2005.

28

Thank You

Introduction to Speaker Diarization

Documents

similarityspeaker segmentation

ratespeaker segmentation

speaker diarization

chen et

cettolo et

distance siegler et

bic theory

barras et