Top Banner
9/20/2004 Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute
37

Speaker ID Smorgasbord or How I spent My Summer at ICSI

Dec 31, 2015

Download

Documents

Speaker ID Smorgasbord or How I spent My Summer at ICSI. Kofi A. Boakye International Computer Science Institute. Outline. Keyword System Enhancements Monophone System Hybrid HMM/SVM Score Combinations Possible Directions. Keyword System: A Review. Motivation - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Speaker ID Smorgasbordor

How I spent My Summer at ICSI

Kofi A. Boakye

International Computer Science Institute

Page 2: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Outline

• Keyword System

• Enhancements

• Monophone System

• Hybrid HMM/SVM

• Score Combinations

• Possible Directions

Page 3: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Keyword System: A ReviewMotivation

I. Text-dependent systems have high performance, but limited flexibility when compared to text-independent systems

Capitalize on advantages of text-dependent systems in this text-independent domain by limiting words of interest to a select group:

Backchannels (yeah, uhhuh) , filled pauses (um, uh), discourse markers (like, well, now…)

=> high frequency and high speaker-characteristic quality

II. GMMs assume frames are independent and fail to take advantage of sequential information

=> Use HMMs instead to model the evolution of speech in time

Page 4: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Keyword System: A ReviewApproach

Model each speaker using a collection of keyword HMMs

Speaker models generated via adaptation of background models trained from a development data set

Use standard likelihood ratio approach:

Compute log likelihood ratio scores using accumulated log probabilities from keyword HMMs

Use a speech recognizer to:

1) Locate words in the speech stream

2) Align speech frames to the HMM

3) Generate acoustic likelihood scores

Word Extractor

HMM-UBMN

HMM-UBM2

HMM-UBM1

signal

Com

bin

atio

n

Page 5: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Keyword System: A ReviewKeywordsDiscourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean}

Filled pauses: {um, uh}

Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know }

Keyword ModelsSimple left-to-right (whole word) HMMs with self-loops and no skips

4 Gaussian components per state

Number of states related to number of phones and median number of frames for word

HMMs trained and scored using HTK

Acoustic features: 19 mel-cepstra, zeroth cepstrum, and their first differences

Page 6: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

System Performance

Switchboard 1 Dev Set

Data partitioned into 6 splits

Tests use jack-knifing procedure:

Test on splits 1 - 3 using background model trained on splits 4 – 6 (and vice versa)

For development, tested primarily on split 1 with 8-side training

Result:EER = 0.83%

Page 7: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

System Performance

0

10

20

30

40

50

60

EER(%)

EER by Word

0

5000

10000

15000

20000

25000

30000

35000

Word Frequencies

Observations:

Well-performing bigrams have comparable EERs

Poorly-performing bigrams suffer from a paucity of data

•Suggests possibility of frequency threshold for performance

Single word ‘yeah’ yields EER of 4.62%

Page 8: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: WordsExamine the performance of other words

Sturim et al. propose word sets for text-constrained GMM system

1) Full set: 50 words that occur in > 70% of conversation sides

Page 9: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: WordsExamine the performance of other words

Sturim et al. propose word sets for text-constrained GMM system

1) Full set: 50 words that occur in > 70% of conversation sides

{ and, I , that, yeah, you, just like, uh, to, think, the, have, so, know, in, but, they, really, it, well, is, not, because, my, that’s, on, its, about, do, for, was, don’t, one, get, all, with, oh, a, we, be, there, of, this, I’m, what, out, or, if, are, at }

Page 10: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: WordsExamine the performance of other words

Sturim et al. propose word sets for text-constrained GMM system

1) Full set: 50 words that occur in > 70% of conversation sides

2) Min set: 11 words that yield the lowest word-specific EERs

Page 11: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: WordsExamine the performance of other words

Sturim et al. propose word sets for text-constrained GMM system

1) Full set: 50 words that occur in > 70% of conversation sides

2) Min set: 11 words that yield the lowest word-specific EERs

{and, I, that, yeah, you, just, like, uh, to, think, the}

Page 12: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Words

Performance

Full set:

EER = 1.16%

My set Full set =

{ yeah, like, uh, well, I,

think, you }

Page 13: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Words

0

5

10

15

20

25

30

EER (% )

EER by Word

0

20000

40000

60000

80000

100000

120000

140000

Word Count

Observations:

Some poorly performing words occur quite frequently

•Such words may simply not be highly discriminative in nature

Single word ‘and’ yields EER of 2.48% !!

Page 14: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Words

Performance

Min set:

EER = 0.99%

My set Min set =

{yeah, like, uh, I, you, think}

Page 15: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Words

0

1

2

3

4

5

6

7

EER (% )

EER by Word

0

20000

40000

60000

80000

100000

120000

140000

Word Count

Observations:

Except for ‘and’, min set words have comparable performance

Most can fall into one of the three categories of filled pause, discourse marker, or backchannel, either in isolation or conjunction

Page 16: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: HNormTarget model scores have different distributions for utterances based on handset type

Perform mean and variance normalization of scores based on estimated impostor score distribution

For split 1, use impostor utterances from splits 2 and 3

•75 females

•86 males

tgt1

tgt2

elec

elec

carb

carb

LR scores HNorm Scores

Page 17: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: HNorm

Performance

EER = 1.65%

Performance worsened!

Possible issue in HNorm implementation?

Page 18: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: HNorm

Examine effect of HNorm on particular speaker scores

Speakers of interest: Those generating the most errors

3 Speakers each generating 4 errors

Page 19: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: HNorm

Page 20: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: HNorm

Page 21: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: HNorm

Page 22: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: HNormConclusion: HNorm works…but doesn’t

One possibility: Look at computed devs…

Old NewSpeaker 1316 Impostor 0.588 1.333

Target 1.039 2.122Speaker 1437 Impostor 0.342 0.786

Target 0.381 0.766Speaker 1530 Impostor 0.554 1.375

Target 0.42 0.992

Distributions are widening in some cases

Page 23: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Deltas

Problem: System performance differs significantly by gender

Hypothesis: Higher deltas for females may be noisier

Solution: Use longer window for delta computation to smooth

Page 24: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Deltas

Extended window size from 2->3

Result:

EER = 0.83%

Performance nearly indistinguishable

Page 25: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Deltas

Extended window size from 2->3

Result:

Male and female disparity remains

Page 26: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Deltas

Extended window size from 3->5

Result:

EER = 1.32%

Performance worsens!

Page 27: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Enhancements: Deltas

Extended window size from 3->5

Result:

Male female disparity widens

Further investigation necessary

Page 28: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Monophone SystemMotivation

Keyword system, with its use of HMMs, appears to have good performance

However, we are only using a small amount (~10%) of the total data available

=>Get full coverage by using phone HMMs rather than word HMMs

System represents a trade-off between token coverage and “sharpness” of modeling

Page 29: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Monophone SystemImplementation

System implemented similarly to keyword system, with phones replacing words

Background models differ in that:

1) All models have 3 states, with 128 Gaussians per state

2) Models trained by successive splitting and Baum-Welch re-estimation, starting with a single Gaussian

Page 30: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Monophone System

Performance

EER = 1.16%

Similar performance to keyword system

Uses a lot more data!

Page 31: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Hybrid HMM/SVM SystemMotivation

SVMs have been shown to yield good performance in speaker recognition systems

Features used:

• Frames

• Phone and word n-gram counts/frequencies

• Phone lattices

Page 32: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Hybrid HMM/SVM SystemMotivation

Keyword system looks at “distance” between target and background models as measured by log-probabilities

Look at distance between models more explicitly

=> Use model parameters as features

Page 33: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Hybrid HMM/SVM SystemApproach

Use concatenated mixture means as features for SVM

Positive examples obtained by adapting background HMM to each of 8 training conversations

Negative examples obtained by adapting background HMM to each conversation in the background set

Keyword-level SVM outputs combined to give final score

-Presently simple linear combination with equal weighting is used (though clearly suboptimal)

Page 34: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Hybrid HMM/SVM System

Performance

EER = 1.82%

Promising first start

Page 35: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Score Combination

We have three independent systems, so let’s see how they combine…

Perform post facto (read: cheating) linear combination

System EER (%)Baseline 0.83Monophone 1.16SVM 1.82

System EER (%) WeightsBaseline + Monophone 0.66 0.5, 0.5Baseline + SVM 0.66 0.6, 0.4Monophone + SVM 0.66 0.6, 0.4

Each best combination yields same EER

=>Possibly approaching EER limit for data set

Page 36: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Possible Directions• Develop on SWB2

• Create word “master list” for keyword system

• TNorm

• Modify features to address gender-specific performance disparity

• Score combination for hybrid system

• Modified hybrid system

• Tuning

• Plowing

Page 37: Speaker ID Smorgasbord or  How I spent My Summer at ICSI

9/20/2004 Speech Group Lunch Talk

Fin