Institute of Information Science, Academia Sinica, Institute of Information Science, Academia Sinica, Taiwan Taiwan Speaker Verification via Speaker Verification via Kernel Methods Kernel Methods Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang
Dec 27, 2015
Institute of Information Science, Academia Sinica, TaiwanInstitute of Information Science, Academia Sinica, Taiwan
Speaker Verification via Kernel Speaker Verification via Kernel MethodsMethods
Speaker : Yi-Hsiang Chao
Advisor : Hsin-Min Wang
2
OUTLINEOUTLINE
Current Methods for Speaker Verification
Proposed Methods for Speaker Verification
Kernel Methods for Speaker Verification
Experiments
Conclusions
3
What is speaker verification ? Goal: To determine if a speaker is who he or she
claims to be. Speaker verification is a hypothesis testing problem.
Given an input utterance U, two hypotheses have to be considered as H0: U is from the target speaker. (the null hypothesis)
H1: U is not from the target speaker. (the alternative hypothesis)
The Likelihood Ratio (LR) test:
Mathematically, H0 and H1 can be represented by parametric models
denoted as and , respectively. is often called an anti-model.
)reject i.e., ( accept
accept
)|(
)|(log)(
01
0
1
0
HH
H
HUp
HUpUL
(1)
4Current Methods for Speaker Current Methods for Speaker VerificationVerification is usually ill-defined, since H1 does not involve any specific speaker,
and thus lacks explicit data for modeling.
Many approaches have been proposed in attempts to characterize H1:
One simple approach is to train a single speaker-independent model , named the world model or the Universal Background Model (UBM) [D. A. Reynolds, et al., 2000]:
• The training data are collected from a great amount of speakers, generally irrelevant to the clients.
)|(log)λ|(log)(1 UpUpUL
5Current Methods for Speaker Current Methods for Speaker VerificationVerification
Picking the likelihood of the most competitive model: [A. Higgins, et al., 1991]
Averaging the likelihoods of the B cohort models arithmetically: [D. A. Reynolds, 1995]:
Averaging the likelihoods of the B cohort models geometrically : [C. S. Liu , et al., 1996]:
B
iiUp
BUpUL
13 )λ|(
1log)λ|(log)(
.)λ|(log1
)λ|(log)(1
4
B
iiUp
BUpUL
)λ|(logmax)λ|(log)(1
2 iBi
UpUpUL
Instead of using a single model, an alternative way is to train a set of cohort models {1, 2,…, B}. This gives the following possibilities in computing LR:
6
Selection of cohort set Two cohort selection methods [D. A. Reynolds, 1995] are used:
• One selects the B closest speakers to each client. (such as L2, L3, L4)
• The other selects the B/2 closest speakers to, plus the B/2 farthest speakers from, each client. (such as L3)
The selection is based on the speaker distance measure [D. A. Reynolds, 1995], computed by
where and are speaker models trained using the i-th speaker’s training utterances and the j-th speaker’s training utterances , respectively.
,)|(
)|(log
)|(
)|(log),(
ij
jj
ji
iiji Xp
Xp
Xp
Xpd
i jiX
jX
Current Methods for Speaker Current Methods for Speaker VerificationVerification
7Current Methods for Speaker Current Methods for Speaker VerificationVerification The Null Hypothesis Characterization
The client model is represented by a Gaussian Mixture Model (GMM): can be trained via the ML criterion by using the Expectation-
Maximization (EM) algorithm. can be derived from the UBM using MAP adaptation. (the adapted GMM).
The adapted GMM + L1 measure => we term the GMM-UBM system.
[D. A. Reynolds, et al., 2000]
Currently, GMM-UBM is the state-of-the-art approach.• This method is appropriate for the Text-Independent (TI) task.
– Advantage: cover unseen data.
8
Motivation: However, none of the LR measures developed so far has proved to be
absolutely superior to the others in any tasks and applications.
We propose two perspectives in attempts to better characterize the ill-defined alternative hypothesis .
Perspective 1: Optimal combination of the existing LRs.
Perspective 2: On the design of the novel alternative hypothesis characterization.
Proposed Methods for Speaker Proposed Methods for Speaker VerificationVerification
9
The pros and cons of different LR measures motivate us to try to combine them into a unified framework by virtue of the complementary information that each LR can contribute.
Given N different LR measures Li(U), i = 1, 2,…, N. We define a
combined LR measure by
Perspective 1: The Proposed Combined LR (ICPR2006)Perspective 1: The Proposed Combined LR (ICPR2006)
, accept 0
accept 0 )(
)(...)()(
1
0
110
H
Hf
b
bULwULwULT
NN
x
xw
where x = [L1(U), L2(U),…, LN(U)]T is an N × 1 vector in the space RN,
w = [w1, w2,…, wN]T is an N × 1 weight vector, and b is a bias.
(2)
)(0 UL
10
forms a so-called linear discriminant classifier.
This classifier translates the goal of solving an LR measure into the optimization of w and b, such that the utterances of clients and impostors can be separated.
To realize this classifier, three distinct data sets are needed: One for generating each client’s model. One for generating each client’s anti-models. One for optimizing w and b.
)( )(0 xxw fbUL T
Linear Discriminant ClassifierLinear Discriminant Classifier
11
The bias b actually plays the same role as the decision threshold of the LR defined in Eq. (1). it can be determined through a trade-off between false acceptance and false
rejection,
The main goal here is to find w.
f(x) can be solved via linear discriminant training algorithms, such as:
Fisher’s Linear Discriminant (FLD). Linear Support Vector Machine (Linear SVM) . Perceptron.
accept 0
accept 0 )(
1
0
H
Hbf T xwx
Linear Discriminant ClassifierLinear Discriminant Classifier
12
Using Fisher’s Linear Discriminant (FLD)
Suppose the i-th class has ni data samples, , i = 1, 2.
The goal of FLD is to seek a direction w such that the following Fisher’s criterion function J(w) is maximized:
Linear Discriminant ClassifierLinear Discriminant Classifier
},..,{ 1in
ii i
xxX
,)(wSw
wSww
wT
bT
J
where Sb and Sw are, respectively, the between-class scatter matrix
and the within-class scatter matrix defined as T
b ))(( 2121 mmmmS
,))((2,1
i
Tiiw
iXx
mxmxS
where is the mean vector of the i-th class.
in
s
is
ii n 1
1xm
13
Using Fisher’s Linear Discriminant (FLD)
The solution for w, which maximizes the Fisher’s criterion J(w), is the leading eigenvector of .
w can be directly calculated as
Linear Discriminant ClassifierLinear Discriminant Classifier
bw SS 1
aaa
aa
w
Tw
Tw
bw
1211
1
21211
21211
1
)(
)( )(
))((
mmSw
wmmwmmS
wwmmmmS
wwSS
(3)
14
The LR approaches that have been proposed to characterize H1 can be
collectively expressed in the following general form :
where F() is some function of the likelihood values from a set of so-called background models {1,2,…,N}.
For example, F() can be
the average function for L3(U), the maximum for L2(U) or the geometric
mean for L4(U), and the background model set here can be obtained from a
cohort. A special case arises when F() is an identity function and N = 1. In this
instance, a single background model is used for L1(U).
Analysis of the Alternative HypothesisAnalysis of the Alternative Hypothesis
,))|( ),...,|(),|((
)|(log)(
21 NUpUpUpF
UpUL
(4)
15
We redesign the function F() as
This function gives N background models different weights according to their individual contribution to the alternative hypothesis.
It is clear that Eq. (5) is equivalent to a geometric mean function when
It is also clear that Eq. (5) will reduce to a maximum function when
Perspective 2: The Novel Alternative Hypothesis Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006)Characterization (submitted to ISCSLP2006)
NNNUpUpUpF ...
1
212121 ))|(...)|()|(()(u (5)
where is an N×1 vector and is the weight of the likelihood p(U | i), i = 1,2,…, N.
TNUpUpUp )]|( ),...,|(),|([ 21 u i
.,...,2,1 ,1 Nii
* ,0 and );λ|(logmaxarg* ,1 1* iiUpi iiNii
16Perspective 2: The Novel Alternative Hypothesis Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006)Characterization (submitted to ISCSLP2006)
)...(/ 21 Niiw
reject,
accept
)|(
)|(log...
)|(
)|(log
)|(
)|(log
)|(
)|(...
)|(
)|(
)|(
)|(log
)|(...)|()|(
)|(log)(
22
11
21
21
21
21
xwT
NN
w
N
ww
wN
ww
Up
Upw
Up
Upw
Up
Upw
Up
Up
Up
Up
Up
Up
UpUpUp
UpUL
N
N
By substituting Eq. (5) into Eq. (4) and letting
(6)
where is an N×1 weight vector and x is an N × 1 vector in the space RN, expressed by
TNwww ] ..., ,[ 21w
T
NUp
Up
Up
Up
Up
Up]
)|(
)|(log,...,
)|(
)|(log ,
)|(
)|([log
21
x (7)
17
The implicit idea in Eq. (7) is that the speech utterance U can be represented by a characteristic vector x.
If we replace the threshold in Eq. (6) with a bias b, the equation can be rewritten as
Analogous to the combined LR method in Eq. (2). f(x) in Eq. (8) forms a linear discriminant classifier again, which can be
solved via linear discriminant training algorithms, such as FLD.
Perspective 2: The Novel Alternative Hypothesis Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006)Characterization (submitted to ISCSLP2006)
)( )( xxw fbUL T (8)
T
NUp
Up
Up
Up
Up
Up]
)|(
)|(log,...,
)|(
)|(log ,
)|(
)|([log
21
x
18
Relation to Perspective 1: The combined LR measure
If the anti-models are instead of the background models for the characteristic vector x defined in Eq. (7):
Perspective 2: The Novel Alternative Hypothesis Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006)Characterization (submitted to ISCSLP2006)
.])|(
)|(log...
)|(
)|(log
)|(
)|([log
21
T
NUp
Up
Up
Up
Up
Up
x
bULwULw
bUp
Upw
Up
Upwf
NN
NN
)(...)(
)|(
)|(log...
)|(
)|(log)(
11
11
x
We obtain
f(x) forms a linear combination of N different LR measures, which is the same as the combined LR measure.
},...,,{ 21 N
19
can be solved via linear discriminant training algorithms. However, such methods are based on the assumption that the observed data
of different classes is linearly separable. It is obviously not feasible in most practical cases with nonlinearly
separable data. From this point of view, we hope
The data from different classes, which is not linearly separable in the original input space RN.
They can be separated linearly in a certain implicit higher dimensional (maybe infinite) feature space F via a nonlinear mapping Φ.
Let Φ(x) denote a vector obtained by mapping x from RN to F. f(x) can be re-defined as
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
bf T )( xwx
, )()( bf T xwx
which constitutes a linear discriminant classifier in F.
(9)
20
In practice, it is difficult to determine the kind of mapping Φ that would be applicable. Therefore, the computation of Φ(x) can be infeasible.
We propose using the kernel method:
It is to characterize the relationship between the data samples in F, instead of computing Φ(x) directly.
This is achieved by introducing a kernel function:
which is the inner product of two vectors Φ(x) and Φ(y) in F.
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
(10), )(),(),( yxyxk
21
The kernel function k() must be symmetric, positive definite and conform to Mercer’s condition. For example:
• The dot product kernel :
• The d-th degree polynomial kernel :
• The Radial Basis Function (RBF) kernel :
Existing kernel-based classification techniques can be applied to implement .
such as : Support Vector Machine (SVM). Kernel Fisher Discriminant (KFD).
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
),( yxyx Tk
bf T )()( xwx
dTk )1(),( yxyx
,2
||||exp),(
2
2
yx
yxk σ is a tunable parameter.
22
Support Vector Machine (SVM)
Techniques based on SVM have been successfully applied to many classification and regression tasks.
Conventional LR:
If the probabilities are perfectly estimated (which is usually not the case), then the Bayes Decision rule is the optimal decision.
However, a better solution should in theory be to use a discriminant framework [V. N. Vapnik, 1995].
[S. Bengio, et al., 2001] proposed that the probability estimates are not perfect and that a better version would be,
where a1 ,a2 and b are adjustable parameters estimated using an SVM .
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
reject
accept )|(log)λ|(log)(1
UpUpUL
bUpaUpa )|(log)λ|(log 21
23
Support Vector Machine (SVM)
[S. Bengio, et al., 2001] incorporated the two scores obtained from GMM and UBM with an SVM.
Compare with our approach:• [S. Bengio, et al., 2001] only used one simple background model, the UBM,
as the alternative hypothesis characterization.• Our approach is considered to integrate multiple background models for the
alternative hypothesis characterization in a more effective and robust way:
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
)()(
)|(log)λ|(log 21
xxw fb
bUpaUpaT
.)]|(log- ,)λ|([log and ] ,[ where 21TT UpUpaa xw
T
NUp
Up
Up
Up
Up
Up]
)|(
)|(log,...,
)|(
)|(log ,
)|(
)|([log
21
x
24
Support Vector Machine (SVM)
The goal of SVM is to seek a separating hyperplane in the feature space F
that maximizes the margin between classes.
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
x x
y y
Optimal hyperplane
Optimal margin
Support vectors
(a) (b)
Classifier in (b) has greater separation distance than (a)
r
25
Support Vector Machine (SVM)
Following the theory of SVM, w can be expressed as
which yields
where each training sample xj belongs to one of the two classes identified by
the label yj{1,1}, j=1, 2,…, l.
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
bf T )()(
:form Original
xwx
,)(1
l
jjjjy xw
,),()(1
bkyfl
jjjj
xxx
26
Support Vector Machine (SVM)
Let T = [1, 2,…, l]. Our goal now changes from finding w to finding .
We can find the coefficients j by maximizing the objective function,
subject to the constraints
where C is a penalty parameter.
The above optimization problem can be solved using the quadratic programming techniques.
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
l
j
l
i
l
jjijijij kyyQ
1 1 1
),(2
1)( xxα
, ,0 and ,01
jCy j
l
jjj
bkyfl
jjjj
1
),()( xxx
27
Support Vector Machine (SVM)
Note that most j are equal to zero, and the training samples with non-zero j
are called support vectors.
A few support vectors act as the key to deciding the optimal margin between classes in the SVM.
An SVM with a dot product kernel function, i.e.,
is known as a linear SVM.
Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification
bkyfl
jjjj
1
),()( xxx
, ),( yxyx Tk
x
y
Optimal hyperplane
Optimal margin
Support vectors
28Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification Kernel Fisher Discriminant (KFD)
Alternatively, can be solved with KFD. In fact, the purpose of KFD is to apply FLD in the feature space F. we also
need to maximize the Fisher’s criterion:
where and are, respectively, the between-class and the within-class scatter matrices in F, i.e.,
,)(wSw
wSww
w
Tb
T
J
wS
,))(( 2121T
b mmmmS
,))()()((2,1
i
Tiiw
iXx
mxmxS
in
s
is
ii n 1
)(1
xmwhere is the mean vector of the i-th class in F.
bf T )()( xwx
bS
29Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification Kernel Fisher Discriminant (KFD)
Let and .
According to the theory of reproducing kernels, the solution of w must lie in the span of all training data samples mapped in F, w can be expressed as
Accordingly, can be re-written as
Let T = [1, 2,…, l]. Our goal therefore changes from finding w to
finding , which maximizes
},..,{},..,{},..,{ 122
111
121 21 lnn xxxxxxXX 21 nnl
l
jjj
1
)(xw
bf T )()( xwx
bkfl
jjj
1
),()( xxx
Nαα
Mααα
T
T
J )( ,)(wSw
wSww
w
Tb
T
J
30Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification Kernel Fisher Discriminant (KFD)
,))(( 2121
TηηηηM
,)(2,1
i
Tinni ii
K1IKN
Nαα
Mααα
T
T
J )(
),()(h matrix witanis
),()/1()(h vector wit1anis 1
isjjsiii
n
s
isjijii
knl
knl i
xxKK
xxηη
Ini is an ni×ni identity matrix, and 1ni is an ni×ni matrix with all entries 1/ni.
The solution for is analogous to FLD in Eq. (3):
),( 211 ηηNα
which is also the leading eigenvector of N-1M.
where
).( 21
1 mmSw w
31ExperimentsExperiments Formation of the Characteristic Vector
In our methods, we use B+1 background models, consisting of • B cohort set models,
• One world model,
to form the characteristic vector x. Two cohort selection methods are used in the experiments:
• B closest speakers.
• B/2 closest speakers + B/2 farthest speakers.
To yield the following two (B+1)×1 characteristic vectors:
,])|(
)|(log ...
)|(
)|(log
)|(
)|([log
cst 1cst
T
BUp
Up
Up
Up
Up
Up
x
,])|(
)|(log ...
)|(
)|(log
)|(
)|(log ...
)|(
)|(log
)|(
)|([log
/2fst 1fst /2cst 1cst
T
BB Up
Up
Up
Up
Up
Up
Up
Up
Up
Up
x
where and are, respectively, the i-th closest model and the i-th farthest model of the client model .
icst ifst
32
Detection cost Function (DCF) The NIST Detection Cost Function (DCF) , which reflects the performance
at a single operating point on the DET curve. The DCF is defined as
• and are the miss probability and the false-alarm probability, respectively.
• and are the respective relative costs of detection errors.
• is the a priori probability of the specific target speaker.
A special case of the DCF is known as the Half Total Error Rate (HTER), where and are both equal to 1, and = 0.5, i.e.,
ExperimentsExperiments
)1( argarg etTFalseAlarmFalseAlarmetTMissMissDET PPCPPCC
MissPFalseAlarmP
FalseAlarmCMissC
etTP arg
FalseAlarmCMissC etTP arg
)(2
1HTER FalseAlarmMiss PP
33
Experiments - Experiments - XM2VTSDBXM2VTSDB
Table 1. Configuration of the XM2VTSDB speech database.
Session Shot 199 clients 25 impostors 69 impostors
1 1
2 1
2 2
Training
1 3
2 Evaluation
1 4
2 Test
Evaluation Test
“Training” subset to build the individual client’s model and anti-models.
“Evaluation” subset to estimate , w and b.
“Test” subset for the performance evaluation.
1. “0 1 2 3 4 5 6 7 8 9”.
2. “5 0 6 9 2 8 1 3 7 4”.
3. “Joe took father’s green shoe bench out”.
34
Experimental results (ICPR2006)Experimental results (ICPR2006)
XM2VTSDB For perspective 1:
The proposed combined LR
Figure 1. Baselines vs. the Combined LRs :DET curves for “Test”.
Further analysis of the results via the equal error rate (EER) showed that a 13.2% relative improvement was achieved by KFD (EER = 4.6%), compared to 5.3% of L3(U).
35
Experimental results (submitted to ISCSLP2006)Experimental results (submitted to ISCSLP2006) XM2VTSDB
For perspective 2: The novel alternative hypothesis characterization
Table 2. HTERs for “Evaluation” and “Test” subsets (The XM2VTSDB task).
min HTER for “Evaluation” HTER for “Test” L1 0.0633 0.0519
L2_20c 0.0776 0.0635 L3_20c 0.0676 0.0535
L3_10c_10f 0.0589 0.0515 L4_20c 0.0734 0.0583
KFD_w_20c 0.0247 0.0357 SVM_w_20c 0.0320 0.0414
KFD_w_10c_10f 0.0232 0.0389 SVM_w_10c_10f 0.0310 0.0417
A 30.68% relative improvement was achieved by KFD_w_20c, compared to L3_10c_10f – the best baseline system.
36
Experimental results (submitted to ISCSLP2006)Experimental results (submitted to ISCSLP2006)
XM2VTSDB
For perspective 2: The proposed novel
alternative hypothesis characterization
Figure 2. Best baselines vs. our proposed LRs :DET curves for “Test” subset.
37
For perspective 2: The proposed novel alternative
hypothesis characterization.
In the text-independent speaker
verification task.
We observe that KFD_w_50c_50f achieved a 34.08% relative improvement over GMM-UBM.
Evaluation on the ISCSLP2006-SRE databaseEvaluation on the ISCSLP2006-SRE database
Table 3. DCFs for “Evaluation” and “Test” subsets (The ISCSLP2006-SRE task).
min DCF for “Evaluation” DCF for “Test” GMM-UBM (1024m) 0.0129 0.0179
KFD_w_50c_50f 0.0067 0.0118
1 ,10 FalseAlarmMiss CC
05.0arg etTP
)1( arg
arg
etTFalseAlarmFalseAlarm
etTMissMissDET
PPC
PPCC
with
38
Evaluation on the ISCSLP2006-SRE databaseEvaluation on the ISCSLP2006-SRE database
I2R-SDPG_sg Actual DCF = 0.90
SINICA-IIS_tw Actual DCF = 1.18
CUHK-EE_hk Actual DCF = 2.77
THU-EE_cn Actual DCF = 2.85
EPITA_fr_1 Actual DCF = 3.92
EPITA_fr_2 Actual DCF = 4.46
We participated in the text-independent speaker verification task of the ISCSLP2006 Speaker Recognition Evaluation (SRE) plan.
The evaluation results are given as follows
39
ConclusionsConclusions We have introduced current LR systems for speaker verification. We have presented two proposed LR systems:
The combined LR system. The new LR system with the novel alternative hypothesis characterization.
Both proposed LR systems can be formulated as a linear or non-linear discriminant classifier. Non-linear classifiers can be implemented by using kernel methods:
• Kernel Fisher Discriminant (KFD)
• Support Vector Machine (SVM)
Experiments conducted on two speaker verification tasks The XM2VTSDB task The ISCSLP2006-SRE task
The superiority of our methods over conventional approaches.