Top Banner
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to Douglas Reynol
44

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Mar 29, 2015

Download

Documents

Alfredo Pashley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

CS 224S / LINGUIST 285Spoken Language Processing

Dan JurafskyStanford University

Spring 2014

Lecture 15: Speaker Recognition

Lots of slides thanks to Douglas Reynolds

Page 2: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Why speaker recognition?Access Control

physical facilitieswebsites, computer networks

Transaction Authenticationtelephone bankingremove credit card purachse

Law Enforcementforensicssurveillance

Speech Data Miningmeeting summarizationlecture transcription

slide text from Douglas Reynolds

Page 3: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Three Speaker Recognition Tasks

slide from Douglas Reynolds

Page 4: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Two kinds of speaker verification

Text-dependent Users have to say something specificeasier for system

Text-independentUsers can say whatever they wantmore flexible but harder

Page 5: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Two phases to speaker detection

slide from Douglas Reynolds

Page 6: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Detection: Likelihood RatioTwo-class hypothesis test:

H0: X is not from the hypothesized speakerH1: X is from the hypothesized speaker

Choose the most likely hypothesis

Likelihood ratio test:

slide from Douglas Reynolds

Page 7: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Speaker IDLog-Likelihood Ratio ScoreLLR= Λ =log p(X|H1) − log p(X|H0)

Need two models Hypothesized speaker model for H1 Alternative (background) model for H0

slide from Douglas Reynolds

Page 8: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

How do we get H1?

Pool speech from several speakers and train a single model: a universal background model (UBM)can train one UBM and use as H1 for all

speakersShould be trained using speech representing

the expected impostor speechSame type speech as speaker enrollment

(modality, language, channel)Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, Petrovska-Delacretaz, Reynolds

Page 9: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

How to compute P(H|X)?

Gaussian Mixture Models (GMM)The traditional best model for text-

independent speaker recognitionSupport Vector Machines (SVM)

More recent use of discriminative model

Page 10: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Form of GMM/HMM depends on application

slide from Douglas Reynolds

Page 11: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

GMMs for speaker recognitionA Gaussian mixture model

(GMM) represents features as the weighted sum of multiple Gaussian distributions

Each Gaussian state i has aMean CovarianceWeight

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Dim 1Dim 2

Model

( | )p x

i

Page 12: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Recognition SystemsGaussian Mixture Models

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Parameters iμ

i

iw

Dim 1Dim 2

( )p x

Page 13: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Recognition SystemsGaussian Mixture Models

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Model Components

Parameters

Dim 1Dim 2

( )p x

Page 14: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

GMM trainingDuring training, the

system learns about the data it uses to make decisionsA set of features are

collected from a speaker (or language or dialect)

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Training Features

Dim 1Dim 2

Dim 1Dim 2

Model

1x2x

( )p x

Page 15: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Recognition Systems for Language, Dialect, Speaker ID

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Model Components

Languages,Dialects,

or Speakers

Parameters

1Model

2Model

3Model

( | )Cp x

Dim 1Dim 2

In LID, DID, and SID,we train a set of target models

for each dialect, language, or speakerC

Page 16: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Recognition SystemsUniversal Background Model

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Model Components

Parameters

( | )Cp x

We also train a universal backgroundmodel representing all speechC

Model C

Dim 1Dim 2

Page 17: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Recognition SystemsHypothesis Test

Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

0

1

: is from the hypothesized class

: is not from the hypothesized classtest

test

H X

H X

1 2{ , , , }test KX x x x

Dim 1Dim 2

Page 18: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Recognition SystemsHypothesis Test

Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

0

1

: is from the hypothesized class

: is not from the hypothesized classtest

test

H X

H X

1 2{ , , , }test KX x x x

0 ?H

1 ?H

1( | )p x

( | )Cp x

Dim 1Dim 2

Dim 1Dim 2

Dim 1Dim 2

Page 19: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Recognition SystemsHypothesis Test

Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it

Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

1 2{ , , , }test KX x x x

1( | )p x

( | )Cp x

Dim 1Dim 2

Dim 1Dim 2

Dim 1Dim 2

Dan?

UBM (not Dan)?

Page 20: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

More details on GMMsInstead of training speaker model on only speaker dataAdapt the UBM to that speaker

takes advantage of all the dataMAP adaptation: new mean of each Gaussian is a weighted

mix of the UBM and the speakerWeigh the speaker more if we have more data:

μi =α Ei(x) + (1−α) μi α=n/(n+16)

Page 21: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Gaussian mixture modelsFeatures are normal MFCC

can use more dimensions (20 + deltas)UBM background model: 512–2048

mixturesSpeaker’s GMM: 64–256 mixturesOften combined with other classifiers

in mixture-of-experts

Page 22: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

SVM

Train a one-versus-all discriminative classifierVarious kernelsCombine with GMM

Page 23: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Other features

ProsodyPhone sequencesLanguage Model features

Page 24: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Doddington (2001)

Word bigrams can be very informative about speaker identity

Page 25: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Evaluation Metric

Trial: Are a pair of audio samples spoken by the same person?

Two types of errors: False reject = Miss: incorrectly reject a true trial

Type-I error False accept: incorrectly accept false trial

Type-II errorPerformance is trade-off between these two errors

Controlled by adjustment of the decision threshold

slide from Douglas Reynolds

Page 26: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

ROC and DET curves

slide from Douglas Reynolds

P(false reject) vs. P(false accept) shows system performance

Page 27: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

DET curve

slide from Douglas Reynolds

Application operating point depends on relative costs of the two errors

Page 28: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Evaluation tasks

slide from Douglas Reynolds

Performance numbers depend on evaluation conditions

Page 29: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Rough historical trends in performance

slide from Douglas Reynolds

Page 30: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Milestones in the NIST SRE Program

1992 – DARPA: limited speaker id evaluation1996 – First SRE in current series2000 – AHUMADA Spanish data, first non-English speech2001 – Cellular data 2001 – ASR transcripts provided2002 – FBI “forensic” database2005 – Mutiple languages with bilingual speakers2005 – Room mic recordings, cross-channel trials2008 – Interview data2010 – New decision cost function: lower FA rate region2010 – High and low vocal effort, aging2011 –broad range of conditions, included noise and reverbFrom Alvin Martin’s 2012 talk on the NIST SR Evaluations

Page 31: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

MetricsEqual Error Rate

Easy to understandNot operating point of interest

FA rate at fixed miss rateE.g. 10%May be viewed as cost of listening to false

alarmsDecision Cost Function

From Alvin Martin’s 2012 talk on the NIST SR Evaluations

Page 32: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Decision Cost Function CDet

Weighted sum of miss and false alarm error probabilities:

CDet = CMiss × PMiss|Target × PTarget

+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget)

Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, Ptarget:

CMiss CFalseAlarm PTarget

‘96-’08 10 1 0.012010 1 1 .001

From Alvin Martin’s 2012 talk on the NIST SR Evaluations

Page 33: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Accuracies

From Alvin Martin’s 2012 talk on the NIST SR Evaluations

Page 34: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

How good are humans?

Survey of 2000 voice IDs made by trained FBI employeesselect similarly pronounced wordsuse spectrograms (comparing formants, pitch, timing)listen back and forth

Evaluated based on "interviews and other evidence in the investigation" and legal conclusions

No decision 65.2% (1304)Non-match 18.8% (378) FR = 0.53% (2)Match 15.9% (318) FA = 0.31% (1)

Bruce E. Koenig. 1986. Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6)

Page 35: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Speaker diarization

Conversational telephone speech2 speakers

Broadcast newsMany speakers although often in dialogue (interviews)

or in sequence (broadcast segments).

Meeting recordingsMany speakers, lots of overlap and disfluencies

Tranter and Reynolds 2006

Page 36: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Speaker diarization

Tranter and Reynolds 2006

Page 37: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Step 1: Speech Activity Detection

Meetings or broadcast:Use supervised GMMs

two models: speech/non-speechor could have extra models for music, etc.

Then do Viterbi segmentation, possibly withminimum length constraints orsmoothing rules

TelephoneSimple energy/spectrum speech activity detection

State of the art:Broadcast: 1% miss, 1-2% false alarmMeeting: 2% miss, 2-3% false alarm Tranter and Reynolds 2006

Page 38: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Step 2: Change Detection1. Look at adjacent windows of data2. Calculate distance between them3. Decide whether windows come from same sourceTwo common methods:

To look for change points within window use likelihood ratio test to see if better modeled by one distribution or two. If two, insert change and start new window thereIf one, expand window and check again

represent each window by a Gaussian, compare neighboring windows with KL distance, find peaks in distance function, threshold

Tranter and Reynolds 2006

Page 39: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Step 3: Gender Classification

Supervised GMMsIf doing Broadcast news, also do bandwidth

classification (studio wideband speech versus narrowband telephone speech)

Tranter and Reynolds 2006

Page 40: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Step 4: ClusteringHierarchical agglomerative clustering1. initialize leaf clusters of tree with speech segments;2. compute pair-wise distances between each cluster;3. merge closest clusters;4. update distances of remaining clusters to new cluster;5. iterate steps 2-4 until stopping criterion is met

Tranter and Reynolds 2006

Page 41: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Step 5: ResegmentationUse final clusters and non-speech modelsTo resegment data via Viterbi decodingGoal:

refine original segmentationfix short segments that may have been removed

Tranter and Reynolds 2006

Page 42: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

TDOA featuresFor meetings, with multiple-microphonesTime-Delay-of-Arrival (TDOA) features

correlate signals from mikes and figure out time shiftused to sync up multiple microphonesand as a feature for speaker localization

assume the speaker doesn’t move, so they are near the same microphone

Page 43: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

EvaluationSystems give start-stop times of speech segments with

speaker labels nonscoring “collar” of 250 ms on either side

DER (Diarization Error Rate)missed speech (% of speech in the ground-truth but not in the hypothesis)false alarm speech (% of speech in the hypothesis but not in the ground-truth) speaker error (% of speech assigned to the wrong speaker)

Recent mean DER for Multiple Distant Mikes (MDM): 8-10% Recent mean DER for SDM: 12-18%

Page 44: CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Summary: Speaker Recognition Tasks

slide from Douglas Reynolds