CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

CS 224S / LINGUIST 285Spoken Language Processing

Dan JurafskyStanford University

Spring 2014

Lecture 15: Speaker Recognition

Lots of slides thanks to Douglas Reynolds

Why speaker recognition?Access Control

physical facilitieswebsites, computer networks

Transaction Authenticationtelephone bankingremove credit card purachse

Law Enforcementforensicssurveillance

Speech Data Miningmeeting summarizationlecture transcription

slide text from Douglas Reynolds

Three Speaker Recognition Tasks

slide from Douglas Reynolds

Two kinds of speaker verification

Text-dependent Users have to say something specificeasier for system

Text-independentUsers can say whatever they wantmore flexible but harder

Two phases to speaker detection


Detection: Likelihood RatioTwo-class hypothesis test:

H0: X is not from the hypothesized speakerH1: X is from the hypothesized speaker

Choose the most likely hypothesis

Likelihood ratio test:


Speaker IDLog-Likelihood Ratio ScoreLLR= Λ =log p(X|H1) − log p(X|H0)

Need two models Hypothesized speaker model for H1 Alternative (background) model for H0


How do we get H1?

Pool speech from several speakers and train a single model: a universal background model (UBM)can train one UBM and use as H1 for all

speakersShould be trained using speech representing

the expected impostor speechSame type speech as speaker enrollment

(modality, language, channel)Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, Petrovska-Delacretaz, Reynolds

How to compute P(H|X)?

Gaussian Mixture Models (GMM)The traditional best model for text-

independent speaker recognitionSupport Vector Machines (SVM)

More recent use of discriminative model

Form of GMM/HMM depends on application


GMMs for speaker recognitionA Gaussian mixture model

(GMM) represents features as the weighted sum of multiple Gaussian distributions

Each Gaussian state i has aMean CovarianceWeight

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Dim 1Dim 2

Model

( | )p x

iμ

i

Recognition SystemsGaussian Mixture Models


Parameters iμ

i

iw

Dim 1Dim 2

( )p x

Recognition SystemsGaussian Mixture Models


Model Components

Parameters

Dim 1Dim 2

( )p x

GMM trainingDuring training, the

system learns about the data it uses to make decisionsA set of features are

collected from a speaker (or language or dialect)


Training Features

Dim 1Dim 2

Dim 1Dim 2

Model

1x2x

( )p x

Recognition Systems for Language, Dialect, Speaker ID


Model Components

Languages,Dialects,

or Speakers

Parameters

1Model

2Model

3Model

( | )Cp x

Dim 1Dim 2

In LID, DID, and SID,we train a set of target models

for each dialect, language, or speakerC

Recognition SystemsUniversal Background Model


Model Components

Parameters

( | )Cp x

We also train a universal backgroundmodel representing all speechC

Model C

Dim 1Dim 2

Recognition SystemsHypothesis Test

Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it


0

1

: is from the hypothesized class

: is not from the hypothesized classtest

test

H X

H X

1 2{ , , , }test KX x x x

Dim 1Dim 2




0

1

: is from the hypothesized class

: is not from the hypothesized classtest

test

H X

H X

1 2{ , , , }test KX x x x

0 ?H

1 ?H

1( | )p x

( | )Cp x

Dim 1Dim 2

Dim 1Dim 2

Dim 1Dim 2




1 2{ , , , }test KX x x x

1( | )p x

( | )Cp x

Dim 1Dim 2

Dim 1Dim 2

Dim 1Dim 2

Dan?

UBM (not Dan)?

More details on GMMsInstead of training speaker model on only speaker dataAdapt the UBM to that speaker

takes advantage of all the dataMAP adaptation: new mean of each Gaussian is a weighted

mix of the UBM and the speakerWeigh the speaker more if we have more data:

μi =α Ei(x) + (1−α) μi α=n/(n+16)

Gaussian mixture modelsFeatures are normal MFCC

can use more dimensions (20 + deltas)UBM background model: 512–2048

mixturesSpeaker’s GMM: 64–256 mixturesOften combined with other classifiers

in mixture-of-experts

SVM

Train a one-versus-all discriminative classifierVarious kernelsCombine with GMM

Other features

ProsodyPhone sequencesLanguage Model features

Doddington (2001)

Word bigrams can be very informative about speaker identity

Evaluation Metric

Trial: Are a pair of audio samples spoken by the same person?

Two types of errors: False reject = Miss: incorrectly reject a true trial

Type-I error False accept: incorrectly accept false trial

Type-II errorPerformance is trade-off between these two errors

Controlled by adjustment of the decision threshold


ROC and DET curves


P(false reject) vs. P(false accept) shows system performance

DET curve


Application operating point depends on relative costs of the two errors

Evaluation tasks


Performance numbers depend on evaluation conditions

Rough historical trends in performance


Milestones in the NIST SRE Program

1992 – DARPA: limited speaker id evaluation1996 – First SRE in current series2000 – AHUMADA Spanish data, first non-English speech2001 – Cellular data 2001 – ASR transcripts provided2002 – FBI “forensic” database2005 – Mutiple languages with bilingual speakers2005 – Room mic recordings, cross-channel trials2008 – Interview data2010 – New decision cost function: lower FA rate region2010 – High and low vocal effort, aging2011 –broad range of conditions, included noise and reverbFrom Alvin Martin’s 2012 talk on the NIST SR Evaluations

MetricsEqual Error Rate

Easy to understandNot operating point of interest

FA rate at fixed miss rateE.g. 10%May be viewed as cost of listening to false

alarmsDecision Cost Function

From Alvin Martin’s 2012 talk on the NIST SR Evaluations

Decision Cost Function CDet

Weighted sum of miss and false alarm error probabilities:

CDet = CMiss × PMiss|Target × PTarget

+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget)

Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, Ptarget:

CMiss CFalseAlarm PTarget

‘96-’08 10 1 0.012010 1 1 .001


Accuracies


How good are humans?

Survey of 2000 voice IDs made by trained FBI employeesselect similarly pronounced wordsuse spectrograms (comparing formants, pitch, timing)listen back and forth

Evaluated based on "interviews and other evidence in the investigation" and legal conclusions

No decision 65.2% (1304)Non-match 18.8% (378) FR = 0.53% (2)Match 15.9% (318) FA = 0.31% (1)

Bruce E. Koenig. 1986. Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6)

Speaker diarization

Conversational telephone speech2 speakers

Broadcast newsMany speakers although often in dialogue (interviews)

or in sequence (broadcast segments).

Meeting recordingsMany speakers, lots of overlap and disfluencies

Tranter and Reynolds 2006

Speaker diarization


Step 1: Speech Activity Detection

Meetings or broadcast:Use supervised GMMs

two models: speech/non-speechor could have extra models for music, etc.

Then do Viterbi segmentation, possibly withminimum length constraints orsmoothing rules

TelephoneSimple energy/spectrum speech activity detection

State of the art:Broadcast: 1% miss, 1-2% false alarmMeeting: 2% miss, 2-3% false alarm Tranter and Reynolds 2006

Step 2: Change Detection1. Look at adjacent windows of data2. Calculate distance between them3. Decide whether windows come from same sourceTwo common methods:

To look for change points within window use likelihood ratio test to see if better modeled by one distribution or two. If two, insert change and start new window thereIf one, expand window and check again

represent each window by a Gaussian, compare neighboring windows with KL distance, find peaks in distance function, threshold


Step 3: Gender Classification

Supervised GMMsIf doing Broadcast news, also do bandwidth

classification (studio wideband speech versus narrowband telephone speech)


Step 4: ClusteringHierarchical agglomerative clustering1. initialize leaf clusters of tree with speech segments;2. compute pair-wise distances between each cluster;3. merge closest clusters;4. update distances of remaining clusters to new cluster;5. iterate steps 2-4 until stopping criterion is met


Step 5: ResegmentationUse final clusters and non-speech modelsTo resegment data via Viterbi decodingGoal:

refine original segmentationfix short segments that may have been removed


TDOA featuresFor meetings, with multiple-microphonesTime-Delay-of-Arrival (TDOA) features

correlate signals from mikes and figure out time shiftused to sync up multiple microphonesand as a feature for speaker localization

assume the speaker doesn’t move, so they are near the same microphone

EvaluationSystems give start-stop times of speech segments with

speaker labels nonscoring “collar” of 250 ms on either side

DER (Diarization Error Rate)missed speech (% of speech in the ground-truth but not in the hypothesis)false alarm speech (% of speech in the hypothesis but not in the ground-truth) speaker error (% of speech assigned to the wrong speaker)

Recent mean DER for Multiple Distant Mikes (MDM): 8-10% Recent mean DER for SDM: 12-18%

Summary: Speaker Recognition Tasks


CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Documents

douglas reynolds slide

gmm slide

speaker detection slide

speaker identity slide

h0 slide

speech dim

channel slide

application slide