Institute of Information Science, Academia Sinica, Institute of Information Science, Academia Sinica, Taiwan Taiwan Introduction to Speaker Introduction to Speaker Diarization Diarization Date: 2007/08/16 Date: 2007/08/16 Speaker: Shih-Sian Cheng Speaker: Shih-Sian Cheng
Introduction to Speaker Diarization. Date: 2007/08/16 Speaker: Shih-Sian Cheng. Outline. Speaker diarization Problem formulation A prototypical speaker diarization system Speaker segmentation Problem formulation Speaker segmentation using a fixed-size analysis window - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Institute of Information Science, Academia Sinica, TaiwanInstitute of Information Science, Academia Sinica, Taiwan
Introduction to Speaker Introduction to Speaker DiarizationDiarization
OutlineOutline Speaker diarization Problem formulation A prototypical speaker diarization system Speaker segmentation Problem formulation Speaker segmentation using a fixed-size analysis window Speaker segmentation using a variable-size analysis window Bottom-up segmentation using BIC Top-down segmentation using BIC Speaker clustering Problem formulation Hierarchical agglomerative clustering Optimization-oriented approaches Two leading speaker diarization systems LIMSI’s system Cambridge’s system
ΔBIC (S. Chen et. al., 1998; P. Delacourt et. al., 2001)
iM
11
Use BIC as an inter-segment distance computation
n)(mH|-λΣ| n
|Σ| m
|-Σ| nm
HBICHBICBIC
NyyyNxxxH
NyyyxxxH
YX
YYnXXm
nm
log)(#log2
log2
log2
)()(
),(~...,, );,(~,...,,:
),(~...,,,,...,,:
0
01
,21211
,21210
Given two audio segments represented by feature vectors and these two segments can be judged as under the same or different acoustic conditions via the following hypothesis test:
X and Y are judged as from the same acoustic condition if BIC <0.
},...,,{ 21 myyyY },...,,{ 21 mxxxX
Seg X
Seg Y
0H1H
Seg X
Seg Y
0H
1H
Ex:
X and Y are from different acoustic conditions, BIC>=0
X and Y are from the same acoustic condition, BIC<=0
14Speaker segmentation (Variable-size Speaker segmentation (Variable-size analysis window approach)analysis window approach) Top-down detection using BIC (Top-down detection using BIC (C. H. Wu and C. H. Hsieh, 2006; ; M. Cettolo et. al., 2005 ))
The top-down detection process for an audio stream
multiple-change-detection
Seg 4Seg 1 Seg 2 Seg 3Audio stream Seg 4Seg 1 Seg 2 Seg 3
15
Multiple-change-detection using BIC
Seg 4Seg 1 Seg 2 Seg 3Audio stream Seg 4Seg 1 Seg 2 Seg 3
H0 :
H1 :
H2 :
H3 :
Assumption: different segments arise from different Gaussian processes
X
pr(X| H0)<pr(X| H1)<pr(X| H2)<pr(X| H3)
Intuitively,
but,
BIC(X|H2)>BIC(X| H3)>BIC(X| H1)>BIC(X| H0)
Multiple-change-detection: Search the H that has the largest BIC value in the solution space
given N speech utterances from P unknown speakers, partition these utterances into M clusters, such that M = P and each cluster consists exclusively of utterances from only one speaker
Partitioning
Speech Utterances Clusters
Speaker 3 Speaker 4
Speaker 1 Speaker 2
18
Cluster PurityThe probability that if we pick any utterance from a cluster twice at random, with replacement, both of the selected utterances are from the same speaker
0.4210
62112
2222
1.06
62
2
0.254
11112
2222
P : total no. of speakers involved,M : total no. of clusters, m : purity of the m-th cluster,
nm* : no. of utterances in the m-th cluster,
n*p : no. of utterances from the p-th speaker,
nmp : no. of utterances in the m-th cluster that are from the p-th speaker
Two error types:I: The number of utterance pairs (with replacement) in the same cluster but from different speakersII: The number of utterance pairs (with replacement) from the same speaker but in different clusters
The number of utterance pairs from the same speaker that are in the same cluster
The number of utterance pairs from the same speaker
M
m
P
pmp
P
pp nn
1 1
2
1
2*
Type II error:
The number of utterance pairs from the same cluster and are in the same cluster
The number of utterance pairs from the same cluster
Minimum rand index clustering (W. H. Tsai and H. M. Wang, Proc. ICASSP, 2007): Performing the grouping of utterances and determining the group number at within the optimization process
M
m
P
pmp
P
pp
M
mm nnnMR
1 1
2
1
2*
1
2* 2)(
N
i
N
j
Mj
Mi hh
1 1
)()( ),(
N
i
N
jji
Mj
Mi oohh
1 1
)()( ),(),(2
constant
(oi , oj ) (the ground truth) is unknown and needs to be estimated.
(oii,ojj) is approximated by a normalized inter-utterance similarity:
jiSS
jioo
jiji if ,),(
if , 1),(ˆ
maxXX
)|Pr()|Pr(
)|Pr(),(
jjii
ijijjiS
XX
XXX
Smax is the maximum among the similarities S(Xi, Xj), i j.
where(Generalized likelihood Ratio)
24Two leading systemsTwo leading systems
LIMSI’s system (Barras et. al., 2006)
Fixed-size sliding window segmentation
Boundary refinement
Use ΔBIC to measure the inter-cluster similarity
,
To filter out short-duration silence segments that were not removed in the initial speech detection step
To remove only long regions without speech such as silence, music, andnoise using GMM
Use the cross-likelihood ratio,
to measure the inter-cluster similarity.Mi is a MAP-adapted GMM .
Boundary refinement; Align the changeboundaries to silence portions
25Two leading systemsTwo leading systems
Cambridge’s system (Sinha et. al., 2005)
SD: speech detection
Speaker identification (SID) clustering:MAP adaptation (mean-only) was applied towards each cluster from the appropriategender/bandwidth UBM.Use the cross likelihood ratio (CLR) betweenany two given clusters.
CPD: change point detection
IAC: iterative agglomerative clustering
26
ReferenceReference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization
of Broadcast News,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006.
NIST 2003 Spring, http://www.nist.gov/speech/tests/rt/rt2003/spring/ R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University
March 2005 Speaker Diarization System,” INTERSPEECH 2005. S. E. Tranter & D. A. Reynolds, “An Overview of Automatic Speaker Diarisation
Systems,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006.
S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.
C. H. Wu and C. H. Hsieh, “Multiple Change-Point Audio Segmentation and Classification Using an MDL-based Gaussian Model,” IEEE Transactions on Audio, Speech and Language Processing, 2006.
M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech and Language, 2005.
M. Siegler, U. Jain, B. Raj and R. Stern, “Automatic Segmentation, Classification and clustering of broadcast News Audio,” in Proc. DARPA Speech Recognition Workshop, 1997.
P. Delacourt and C. J. Welkens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing", Speech Communication, vol. 32, pp 111-126, 2000.
王駿發 , 林博川 , 王家慶 , 宋豪靜 , “ 以支援向量機為基礎之新穎語者切換偵測演算法 ,” in Proc. ROCLING 2005.
27
ReferenceReference C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization
of Broascast News," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no.5, pp. 1505-1512, 2006.
Wei-Ho Tsai, Shih-sian Cheng, and Hsin-min Wang, "Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation," IEEE Trans. on Audio, Speech, and Language Processing, volume 15,
number 4, pages 1461-1474, May 2007. Wei-Ho Tsai and Hsin-min Wang, "Speaker Clustering Based on Minimum Rand
Index," IEEE Int. Conf. Acoustics, Speech, Signal processing (ICASSP2007), April 2007.
R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” INTERSPEECH 2005.