Speaker ID Smorgasbord or How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Speaker ID Smorgasbordor

How I spent My Summer at ICSI

Kofi A. Boakye

International Computer Science Institute


Outline

• Keyword System

• Enhancements

• Monophone System

• Hybrid HMM/SVM

• Score Combinations

• Possible Directions


Keyword System: A ReviewMotivation

I. Text-dependent systems have high performance, but limited flexibility when compared to text-independent systems

Capitalize on advantages of text-dependent systems in this text-independent domain by limiting words of interest to a select group:

Backchannels (yeah, uhhuh) , filled pauses (um, uh), discourse markers (like, well, now…)

=> high frequency and high speaker-characteristic quality

II. GMMs assume frames are independent and fail to take advantage of sequential information

=> Use HMMs instead to model the evolution of speech in time


Keyword System: A ReviewApproach

Model each speaker using a collection of keyword HMMs

Speaker models generated via adaptation of background models trained from a development data set

Use standard likelihood ratio approach:

Compute log likelihood ratio scores using accumulated log probabilities from keyword HMMs

Use a speech recognizer to:

1) Locate words in the speech stream

2) Align speech frames to the HMM

3) Generate acoustic likelihood scores

Word Extractor

HMM-UBMN

HMM-UBM2

HMM-UBM1

signal

Com

bin

atio

n


Keyword System: A ReviewKeywordsDiscourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean}

Filled pauses: {um, uh}

Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know }

Keyword ModelsSimple left-to-right (whole word) HMMs with self-loops and no skips

4 Gaussian components per state

Number of states related to number of phones and median number of frames for word

HMMs trained and scored using HTK

Acoustic features: 19 mel-cepstra, zeroth cepstrum, and their first differences


System Performance

Switchboard 1 Dev Set

Data partitioned into 6 splits

Tests use jack-knifing procedure:

Test on splits 1 - 3 using background model trained on splits 4 – 6 (and vice versa)

For development, tested primarily on split 1 with 8-side training

Result:EER = 0.83%


System Performance

0

10

20

30

40

50

60

EER(%)

EER by Word

0

5000

10000

15000

20000

25000

30000

35000

Word Frequencies

Observations:

Well-performing bigrams have comparable EERs

Poorly-performing bigrams suffer from a paucity of data

•Suggests possibility of frequency threshold for performance

Single word ‘yeah’ yields EER of 4.62%


Enhancements: WordsExamine the performance of other words

Sturim et al. propose word sets for text-constrained GMM system

1) Full set: 50 words that occur in > 70% of conversation sides





{ and, I , that, yeah, you, just like, uh, to, think, the, have, so, know, in, but, they, really, it, well, is, not, because, my, that’s, on, its, about, do, for, was, don’t, one, get, all, with, oh, a, we, be, there, of, this, I’m, what, out, or, if, are, at }





2) Min set: 11 words that yield the lowest word-specific EERs





2) Min set: 11 words that yield the lowest word-specific EERs

{and, I, that, yeah, you, just, like, uh, to, think, the}


Enhancements: Words

Performance

Full set:

EER = 1.16%

My set Full set =

{ yeah, like, uh, well, I,

think, you }


Enhancements: Words

0

5

10

15

20

25

30

EER (% )

EER by Word

0

20000

40000

60000

80000

100000

120000

140000

Word Count

Observations:

Some poorly performing words occur quite frequently

•Such words may simply not be highly discriminative in nature

Single word ‘and’ yields EER of 2.48% !!


Enhancements: Words

Performance

Min set:

EER = 0.99%

My set Min set =

{yeah, like, uh, I, you, think}


Enhancements: Words

0

1

2

3

4

5

6

7

EER (% )

EER by Word

0

20000

40000

60000

80000

100000

120000

140000

Word Count

Observations:

Except for ‘and’, min set words have comparable performance

Most can fall into one of the three categories of filled pause, discourse marker, or backchannel, either in isolation or conjunction


Enhancements: HNormTarget model scores have different distributions for utterances based on handset type

Perform mean and variance normalization of scores based on estimated impostor score distribution

For split 1, use impostor utterances from splits 2 and 3

•75 females

•86 males

tgt1

tgt2

elec

elec

carb

carb

LR scores HNorm Scores


Enhancements: HNorm

Performance

EER = 1.65%

Performance worsened!

Possible issue in HNorm implementation?


Enhancements: HNorm

Examine effect of HNorm on particular speaker scores

Speakers of interest: Those generating the most errors

3 Speakers each generating 4 errors


Enhancements: HNorm


Enhancements: HNorm


Enhancements: HNorm


Enhancements: HNormConclusion: HNorm works…but doesn’t

One possibility: Look at computed devs…

Old NewSpeaker 1316 Impostor 0.588 1.333

Target 1.039 2.122Speaker 1437 Impostor 0.342 0.786

Target 0.381 0.766Speaker 1530 Impostor 0.554 1.375

Target 0.42 0.992

Distributions are widening in some cases


Enhancements: Deltas

Problem: System performance differs significantly by gender

Hypothesis: Higher deltas for females may be noisier

Solution: Use longer window for delta computation to smooth



Extended window size from 2->3

Result:

EER = 0.83%

Performance nearly indistinguishable




Result:

Male and female disparity remains




Result:

EER = 1.32%

Performance worsens!




Result:

Male female disparity widens

Further investigation necessary


Monophone SystemMotivation

Keyword system, with its use of HMMs, appears to have good performance

However, we are only using a small amount (~10%) of the total data available

=>Get full coverage by using phone HMMs rather than word HMMs

System represents a trade-off between token coverage and “sharpness” of modeling


Monophone SystemImplementation

System implemented similarly to keyword system, with phones replacing words

Background models differ in that:

1) All models have 3 states, with 128 Gaussians per state

2) Models trained by successive splitting and Baum-Welch re-estimation, starting with a single Gaussian


Monophone System

Performance

EER = 1.16%

Similar performance to keyword system

Uses a lot more data!


Hybrid HMM/SVM SystemMotivation

SVMs have been shown to yield good performance in speaker recognition systems

Features used:

• Frames

• Phone and word n-gram counts/frequencies

• Phone lattices


Hybrid HMM/SVM SystemMotivation

Keyword system looks at “distance” between target and background models as measured by log-probabilities

Look at distance between models more explicitly

=> Use model parameters as features


Hybrid HMM/SVM SystemApproach

Use concatenated mixture means as features for SVM

Positive examples obtained by adapting background HMM to each of 8 training conversations

Negative examples obtained by adapting background HMM to each conversation in the background set

Keyword-level SVM outputs combined to give final score

-Presently simple linear combination with equal weighting is used (though clearly suboptimal)


Hybrid HMM/SVM System

Performance

EER = 1.82%

Promising first start


Score Combination

We have three independent systems, so let’s see how they combine…

Perform post facto (read: cheating) linear combination

System EER (%)Baseline 0.83Monophone 1.16SVM 1.82

System EER (%) WeightsBaseline + Monophone 0.66 0.5, 0.5Baseline + SVM 0.66 0.6, 0.4Monophone + SVM 0.66 0.6, 0.4

Each best combination yields same EER

=>Possibly approaching EER limit for data set


Possible Directions• Develop on SWB2

• Create word “master list” for keyword system

• TNorm

• Modify features to address gender-specific performance disparity

• Score combination for hybrid system

• Modified hybrid system

• Tuning

• Plowing


Fin

Speaker ID Smorgasbord or How I spent My Summer at ICSI

Documents

speech group lunch talkchart14

speech frames

evolution of speech

speech recognizer

speech stream

textindependent systems

high frequency

textindependent domain