Institute of Information Science, Academia Sinica, Taiwan Speaker Verification via Kernel Methods Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang.

Institute of Information Science, Academia Sinica, TaiwanInstitute of Information Science, Academia Sinica, Taiwan

Speaker Verification via Kernel Speaker Verification via Kernel MethodsMethods

Speaker : Yi-Hsiang Chao

Advisor : Hsin-Min Wang

2

OUTLINEOUTLINE

Current Methods for Speaker Verification

Proposed Methods for Speaker Verification

Kernel Methods for Speaker Verification

Experiments

Conclusions

3

What is speaker verification ? Goal: To determine if a speaker is who he or she

claims to be. Speaker verification is a hypothesis testing problem.

Given an input utterance U, two hypotheses have to be considered as H0: U is from the target speaker. (the null hypothesis)

H1: U is not from the target speaker. (the alternative hypothesis)

The Likelihood Ratio (LR) test:

Mathematically, H0 and H1 can be represented by parametric models

denoted as and , respectively. is often called an anti-model.

)reject i.e., ( accept

accept

)|(

)|(log)(

01

0

1

0

HH

H

HUp

HUpUL

(1)

4Current Methods for Speaker Current Methods for Speaker VerificationVerification is usually ill-defined, since H1 does not involve any specific speaker,

and thus lacks explicit data for modeling.

Many approaches have been proposed in attempts to characterize H1:

One simple approach is to train a single speaker-independent model , named the world model or the Universal Background Model (UBM) [D. A. Reynolds, et al., 2000]:

• The training data are collected from a great amount of speakers, generally irrelevant to the clients.

)|(log)λ|(log)(1 UpUpUL

5Current Methods for Speaker Current Methods for Speaker VerificationVerification

Picking the likelihood of the most competitive model: [A. Higgins, et al., 1991]

Averaging the likelihoods of the B cohort models arithmetically: [D. A. Reynolds, 1995]:

Averaging the likelihoods of the B cohort models geometrically : [C. S. Liu , et al., 1996]:

B

iiUp

BUpUL

13 )λ|(

1log)λ|(log)(

.)λ|(log1

)λ|(log)(1

4

B

iiUp

BUpUL

)λ|(logmax)λ|(log)(1

2 iBi

UpUpUL

Instead of using a single model, an alternative way is to train a set of cohort models {1, 2,…, B}. This gives the following possibilities in computing LR:

6

Selection of cohort set Two cohort selection methods [D. A. Reynolds, 1995] are used:

• One selects the B closest speakers to each client. (such as L2, L3, L4)

• The other selects the B/2 closest speakers to, plus the B/2 farthest speakers from, each client. (such as L3)

The selection is based on the speaker distance measure [D. A. Reynolds, 1995], computed by

where and are speaker models trained using the i-th speaker’s training utterances and the j-th speaker’s training utterances , respectively.

,)|(

)|(log

)|(

)|(log),(

ij

jj

ji

iiji Xp

Xp

Xp

Xpd

i jiX

jX

Current Methods for Speaker Current Methods for Speaker VerificationVerification

7Current Methods for Speaker Current Methods for Speaker VerificationVerification The Null Hypothesis Characterization

The client model is represented by a Gaussian Mixture Model (GMM): can be trained via the ML criterion by using the Expectation-

Maximization (EM) algorithm. can be derived from the UBM using MAP adaptation. (the adapted GMM).

The adapted GMM + L1 measure => we term the GMM-UBM system.

[D. A. Reynolds, et al., 2000]

Currently, GMM-UBM is the state-of-the-art approach.• This method is appropriate for the Text-Independent (TI) task.

– Advantage: cover unseen data.

8

Motivation: However, none of the LR measures developed so far has proved to be

absolutely superior to the others in any tasks and applications.

We propose two perspectives in attempts to better characterize the ill-defined alternative hypothesis .

Perspective 1: Optimal combination of the existing LRs.

Perspective 2: On the design of the novel alternative hypothesis characterization.

Proposed Methods for Speaker Proposed Methods for Speaker VerificationVerification

9

The pros and cons of different LR measures motivate us to try to combine them into a unified framework by virtue of the complementary information that each LR can contribute.

Given N different LR measures Li(U), i = 1, 2,…, N. We define a

combined LR measure by

Perspective 1: The Proposed Combined LR (ICPR2006)Perspective 1: The Proposed Combined LR (ICPR2006)

, accept 0

accept 0 )(

)(...)()(

1

0

110

H

Hf

b

bULwULwULT

NN

x

xw

where x = [L1(U), L2(U),…, LN(U)]T is an N × 1 vector in the space RN,

w = [w1, w2,…, wN]T is an N × 1 weight vector, and b is a bias.

(2)

)(0 UL

10

forms a so-called linear discriminant classifier.

This classifier translates the goal of solving an LR measure into the optimization of w and b, such that the utterances of clients and impostors can be separated.

To realize this classifier, three distinct data sets are needed: One for generating each client’s model. One for generating each client’s anti-models. One for optimizing w and b.

)( )(0 xxw fbUL T

Linear Discriminant ClassifierLinear Discriminant Classifier

11

The bias b actually plays the same role as the decision threshold of the LR defined in Eq. (1). it can be determined through a trade-off between false acceptance and false

rejection,

The main goal here is to find w.

f(x) can be solved via linear discriminant training algorithms, such as:

Fisher’s Linear Discriminant (FLD). Linear Support Vector Machine (Linear SVM) . Perceptron.

accept 0

accept 0 )(

1

0

H

Hbf T xwx


12

Using Fisher’s Linear Discriminant (FLD)

Suppose the i-th class has ni data samples, , i = 1, 2.

The goal of FLD is to seek a direction w such that the following Fisher’s criterion function J(w) is maximized:


},..,{ 1in

ii i

xxX

,)(wSw

wSww

wT

bT

J

where Sb and Sw are, respectively, the between-class scatter matrix

and the within-class scatter matrix defined as T

b ))(( 2121 mmmmS

,))((2,1

i

Tiiw

iXx

mxmxS

where is the mean vector of the i-th class.

in

s

is

ii n 1

1xm

13

Using Fisher’s Linear Discriminant (FLD)

The solution for w, which maximizes the Fisher’s criterion J(w), is the leading eigenvector of .

w can be directly calculated as


bw SS 1

aaa

aa

w

Tw

Tw

bw

1211

1

21211

21211

1

)(

)( )(

))((

mmSw

wmmwmmS

wwmmmmS

wwSS

(3)

14

The LR approaches that have been proposed to characterize H1 can be

collectively expressed in the following general form :

where F() is some function of the likelihood values from a set of so-called background models {1,2,…,N}.

For example, F() can be

the average function for L3(U), the maximum for L2(U) or the geometric

mean for L4(U), and the background model set here can be obtained from a

cohort. A special case arises when F() is an identity function and N = 1. In this

instance, a single background model is used for L1(U).

Analysis of the Alternative HypothesisAnalysis of the Alternative Hypothesis

,))|( ),...,|(),|((

)|(log)(

21 NUpUpUpF

UpUL

(4)

15

We redesign the function F() as

This function gives N background models different weights according to their individual contribution to the alternative hypothesis.

It is clear that Eq. (5) is equivalent to a geometric mean function when

It is also clear that Eq. (5) will reduce to a maximum function when

Perspective 2: The Novel Alternative Hypothesis Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006)Characterization (submitted to ISCSLP2006)

NNNUpUpUpF ...

1

212121 ))|(...)|()|(()(u (5)

where is an N×1 vector and is the weight of the likelihood p(U | i), i = 1,2,…, N.

TNUpUpUp )]|( ),...,|(),|([ 21 u i

.,...,2,1 ,1 Nii

* ,0 and );λ|(logmaxarg* ,1 1* iiUpi iiNii

16Perspective 2: The Novel Alternative Hypothesis Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006)Characterization (submitted to ISCSLP2006)

)...(/ 21 Niiw

reject,

accept

)|(

)|(log...

)|(

)|(log

)|(

)|(log

)|(

)|(...

)|(

)|(

)|(

)|(log

)|(...)|()|(

)|(log)(

22

11

21

21

21

21

xwT

NN

w

N

ww

wN

ww

Up

Upw

Up

Upw

Up

Upw

Up

Up

Up

Up

Up

Up

UpUpUp

UpUL

N

N

By substituting Eq. (5) into Eq. (4) and letting

(6)

where is an N×1 weight vector and x is an N × 1 vector in the space RN, expressed by

TNwww ] ..., ,[ 21w

T

NUp

Up

Up

Up

Up

Up]

)|(

)|(log,...,

)|(

)|(log ,

)|(

)|([log

21

x (7)

17

The implicit idea in Eq. (7) is that the speech utterance U can be represented by a characteristic vector x.

If we replace the threshold in Eq. (6) with a bias b, the equation can be rewritten as

Analogous to the combined LR method in Eq. (2). f(x) in Eq. (8) forms a linear discriminant classifier again, which can be

solved via linear discriminant training algorithms, such as FLD.


)( )( xxw fbUL T (8)

T

NUp

Up

Up

Up

Up

Up]

)|(

)|(log,...,

)|(

)|(log ,

)|(

)|([log

21

x

18

Relation to Perspective 1: The combined LR measure

If the anti-models are instead of the background models for the characteristic vector x defined in Eq. (7):


.])|(

)|(log...

)|(

)|(log

)|(

)|([log

21

T

NUp

Up

Up

Up

Up

Up

x

bULwULw

bUp

Upw

Up

Upwf

NN

NN

)(...)(

)|(

)|(log...

)|(

)|(log)(

11

11

x

We obtain

f(x) forms a linear combination of N different LR measures, which is the same as the combined LR measure.

},...,,{ 21 N

19

can be solved via linear discriminant training algorithms. However, such methods are based on the assumption that the observed data

of different classes is linearly separable. It is obviously not feasible in most practical cases with nonlinearly

separable data. From this point of view, we hope

The data from different classes, which is not linearly separable in the original input space RN.

They can be separated linearly in a certain implicit higher dimensional (maybe infinite) feature space F via a nonlinear mapping Φ.

Let Φ(x) denote a vector obtained by mapping x from RN to F. f(x) can be re-defined as

Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification

bf T )( xwx

, )()( bf T xwx

which constitutes a linear discriminant classifier in F.

(9)

20

In practice, it is difficult to determine the kind of mapping Φ that would be applicable. Therefore, the computation of Φ(x) can be infeasible.

We propose using the kernel method:

It is to characterize the relationship between the data samples in F, instead of computing Φ(x) directly.

This is achieved by introducing a kernel function:

which is the inner product of two vectors Φ(x) and Φ(y) in F.


(10), )(),(),( yxyxk

21

The kernel function k() must be symmetric, positive definite and conform to Mercer’s condition. For example:

• The dot product kernel :

• The d-th degree polynomial kernel :

• The Radial Basis Function (RBF) kernel :

Existing kernel-based classification techniques can be applied to implement .

such as : Support Vector Machine (SVM). Kernel Fisher Discriminant (KFD).


),( yxyx Tk

bf T )()( xwx

dTk )1(),( yxyx

,2

||||exp),(

2

2

yx

yxk σ is a tunable parameter.

22

Support Vector Machine (SVM)

Techniques based on SVM have been successfully applied to many classification and regression tasks.

Conventional LR:

If the probabilities are perfectly estimated (which is usually not the case), then the Bayes Decision rule is the optimal decision.

However, a better solution should in theory be to use a discriminant framework [V. N. Vapnik, 1995].

[S. Bengio, et al., 2001] proposed that the probability estimates are not perfect and that a better version would be,

where a1 ,a2 and b are adjustable parameters estimated using an SVM .


reject

accept )|(log)λ|(log)(1

UpUpUL

bUpaUpa )|(log)λ|(log 21

23


[S. Bengio, et al., 2001] incorporated the two scores obtained from GMM and UBM with an SVM.

Compare with our approach:• [S. Bengio, et al., 2001] only used one simple background model, the UBM,

as the alternative hypothesis characterization.• Our approach is considered to integrate multiple background models for the

alternative hypothesis characterization in a more effective and robust way:


)()(

)|(log)λ|(log 21

xxw fb

bUpaUpaT

.)]|(log- ,)λ|([log and ] ,[ where 21TT UpUpaa xw

T

NUp

Up

Up

Up

Up

Up]

)|(

)|(log,...,

)|(

)|(log ,

)|(

)|([log

21

x

24


The goal of SVM is to seek a separating hyperplane in the feature space F

that maximizes the margin between classes.


x x

y y

Optimal hyperplane

Optimal margin

Support vectors

(a) (b)

Classifier in (b) has greater separation distance than (a)

r

25


Following the theory of SVM, w can be expressed as

which yields

where each training sample xj belongs to one of the two classes identified by

the label yj{1,1}, j=1, 2,…, l.


bf T )()(

:form Original

xwx

,)(1

l

jjjjy xw

,),()(1

bkyfl

jjjj

xxx

26


Let T = [1, 2,…, l]. Our goal now changes from finding w to finding .

We can find the coefficients j by maximizing the objective function,

subject to the constraints

where C is a penalty parameter.

The above optimization problem can be solved using the quadratic programming techniques.


l

j

l

i

l

jjijijij kyyQ

1 1 1

),(2

1)( xxα

, ,0 and ,01

jCy j

l

jjj

bkyfl

jjjj

1

),()( xxx

27


Note that most j are equal to zero, and the training samples with non-zero j

are called support vectors.

A few support vectors act as the key to deciding the optimal margin between classes in the SVM.

An SVM with a dot product kernel function, i.e.,

is known as a linear SVM.


bkyfl

jjjj

1

),()( xxx

, ),( yxyx Tk

x

y

Optimal hyperplane

Optimal margin

Support vectors

28Kernel Methods for Speaker Kernel Methods for Speaker VerificationVerification Kernel Fisher Discriminant (KFD)

Alternatively, can be solved with KFD. In fact, the purpose of KFD is to apply FLD in the feature space F. we also

need to maximize the Fisher’s criterion:

where and are, respectively, the between-class and the within-class scatter matrices in F, i.e.,

,)(wSw

wSww

w

Tb

T

J

wS

,))(( 2121T

b mmmmS

,))()()((2,1

i

Tiiw

iXx

mxmxS

in

s

is

ii n 1

)(1

xmwhere is the mean vector of the i-th class in F.

bf T )()( xwx

bS


Let and .

According to the theory of reproducing kernels, the solution of w must lie in the span of all training data samples mapped in F, w can be expressed as

Accordingly, can be re-written as

Let T = [1, 2,…, l]. Our goal therefore changes from finding w to

finding , which maximizes

},..,{},..,{},..,{ 122

111

121 21 lnn xxxxxxXX 21 nnl

l

jjj

1

)(xw

bf T )()( xwx

bkfl

jjj

1

),()( xxx

Nαα

Mααα

T

T

J )( ,)(wSw

wSww

w

Tb

T

J


,))(( 2121

TηηηηM

,)(2,1

i

Tinni ii

K1IKN

Nαα

Mααα

T

T

J )(

),()(h matrix witanis

),()/1()(h vector wit1anis 1

isjjsiii

n

s

isjijii

knl

knl i

xxKK

xxηη

Ini is an ni×ni identity matrix, and 1ni is an ni×ni matrix with all entries 1/ni.

The solution for is analogous to FLD in Eq. (3):

),( 211 ηηNα

which is also the leading eigenvector of N-1M.

where

).( 21

1 mmSw w

31ExperimentsExperiments Formation of the Characteristic Vector

In our methods, we use B+1 background models, consisting of • B cohort set models,

• One world model,

to form the characteristic vector x. Two cohort selection methods are used in the experiments:

• B closest speakers.

• B/2 closest speakers + B/2 farthest speakers.

To yield the following two (B+1)×1 characteristic vectors:

,])|(

)|(log ...

)|(

)|(log

)|(

)|([log

cst 1cst

T

BUp

Up

Up

Up

Up

Up

x

,])|(

)|(log ...

)|(

)|(log

)|(

)|(log ...

)|(

)|(log

)|(

)|([log

/2fst 1fst /2cst 1cst

T

BB Up

Up

Up

Up

Up

Up

Up

Up

Up

Up

x

where and are, respectively, the i-th closest model and the i-th farthest model of the client model .

icst ifst

32

Detection cost Function (DCF) The NIST Detection Cost Function (DCF) , which reflects the performance

at a single operating point on the DET curve. The DCF is defined as

• and are the miss probability and the false-alarm probability, respectively.

• and are the respective relative costs of detection errors.

• is the a priori probability of the specific target speaker.

A special case of the DCF is known as the Half Total Error Rate (HTER), where and are both equal to 1, and = 0.5, i.e.,

ExperimentsExperiments

)1( argarg etTFalseAlarmFalseAlarmetTMissMissDET PPCPPCC

MissPFalseAlarmP

FalseAlarmCMissC

etTP arg

FalseAlarmCMissC etTP arg

)(2

1HTER FalseAlarmMiss PP

33

Experiments - Experiments - XM2VTSDBXM2VTSDB

Table 1. Configuration of the XM2VTSDB speech database.

Session Shot 199 clients 25 impostors 69 impostors

1 1

2 1

2 2

Training

1 3

2 Evaluation

1 4

2 Test

Evaluation Test

“Training” subset to build the individual client’s model and anti-models.

“Evaluation” subset to estimate , w and b.

“Test” subset for the performance evaluation.

1. “0 1 2 3 4 5 6 7 8 9”.

2. “5 0 6 9 2 8 1 3 7 4”.

3. “Joe took father’s green shoe bench out”.

34

Experimental results (ICPR2006)Experimental results (ICPR2006)

XM2VTSDB For perspective 1:

The proposed combined LR

Figure 1. Baselines vs. the Combined LRs :DET curves for “Test”.

Further analysis of the results via the equal error rate (EER) showed that a 13.2% relative improvement was achieved by KFD (EER = 4.6%), compared to 5.3% of L3(U).

35

Experimental results (submitted to ISCSLP2006)Experimental results (submitted to ISCSLP2006) XM2VTSDB

For perspective 2: The novel alternative hypothesis characterization

Table 2. HTERs for “Evaluation” and “Test” subsets (The XM2VTSDB task).

min HTER for “Evaluation” HTER for “Test” L1 0.0633 0.0519

L2_20c 0.0776 0.0635 L3_20c 0.0676 0.0535

L3_10c_10f 0.0589 0.0515 L4_20c 0.0734 0.0583

KFD_w_20c 0.0247 0.0357 SVM_w_20c 0.0320 0.0414

KFD_w_10c_10f 0.0232 0.0389 SVM_w_10c_10f 0.0310 0.0417

A 30.68% relative improvement was achieved by KFD_w_20c, compared to L3_10c_10f – the best baseline system.

36

Experimental results (submitted to ISCSLP2006)Experimental results (submitted to ISCSLP2006)

XM2VTSDB

For perspective 2: The proposed novel

alternative hypothesis characterization

Figure 2. Best baselines vs. our proposed LRs :DET curves for “Test” subset.

37

For perspective 2: The proposed novel alternative

hypothesis characterization.

In the text-independent speaker

verification task.

We observe that KFD_w_50c_50f achieved a 34.08% relative improvement over GMM-UBM.

Evaluation on the ISCSLP2006-SRE databaseEvaluation on the ISCSLP2006-SRE database

Table 3. DCFs for “Evaluation” and “Test” subsets (The ISCSLP2006-SRE task).

min DCF for “Evaluation” DCF for “Test” GMM-UBM (1024m) 0.0129 0.0179

KFD_w_50c_50f 0.0067 0.0118

1 ,10 FalseAlarmMiss CC

05.0arg etTP

)1( arg

arg

etTFalseAlarmFalseAlarm

etTMissMissDET

PPC

PPCC

with

38

Evaluation on the ISCSLP2006-SRE databaseEvaluation on the ISCSLP2006-SRE database

I2R-SDPG_sg Actual DCF = 0.90

SINICA-IIS_tw Actual DCF = 1.18

CUHK-EE_hk Actual DCF = 2.77

THU-EE_cn Actual DCF = 2.85

EPITA_fr_1 Actual DCF = 3.92

EPITA_fr_2 Actual DCF = 4.46

We participated in the text-independent speaker verification task of the ISCSLP2006 Speaker Recognition Evaluation (SRE) plan.

The evaluation results are given as follows

39

ConclusionsConclusions We have introduced current LR systems for speaker verification. We have presented two proposed LR systems:

The combined LR system. The new LR system with the novel alternative hypothesis characterization.

Both proposed LR systems can be formulated as a linear or non-linear discriminant classifier. Non-linear classifiers can be implemented by using kernel methods:

• Kernel Fisher Discriminant (KFD)

• Support Vector Machine (SVM)

Experiments conducted on two speaker verification tasks The XM2VTSDB task The ISCSLP2006-SRE task

The superiority of our methods over conventional approaches.

Institute of Information Science, Academia Sinica, TaiwanInstitute of Information Science, Academia Sinica, Taiwan

THANK YOU!THANK YOU!

Institute of Information Science, Academia Sinica, Taiwan Speaker Verification via Kernel Methods Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang.

Documents

speaker verification

speaker models

target speaker

kernel methods speaker

taiwan speaker verification

specific speaker

speaker distance measure

client model