CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to Douglas Reynol
Mar 29, 2015
CS 224S / LINGUIST 285Spoken Language Processing
Dan JurafskyStanford University
Spring 2014
Lecture 15: Speaker Recognition
Lots of slides thanks to Douglas Reynolds
Why speaker recognition?Access Control
physical facilitieswebsites, computer networks
Transaction Authenticationtelephone bankingremove credit card purachse
Law Enforcementforensicssurveillance
Speech Data Miningmeeting summarizationlecture transcription
slide text from Douglas Reynolds
Three Speaker Recognition Tasks
slide from Douglas Reynolds
Two kinds of speaker verification
Text-dependent Users have to say something specificeasier for system
Text-independentUsers can say whatever they wantmore flexible but harder
Two phases to speaker detection
slide from Douglas Reynolds
Detection: Likelihood RatioTwo-class hypothesis test:
H0: X is not from the hypothesized speakerH1: X is from the hypothesized speaker
Choose the most likely hypothesis
Likelihood ratio test:
slide from Douglas Reynolds
Speaker IDLog-Likelihood Ratio ScoreLLR= Λ =log p(X|H1) − log p(X|H0)
Need two models Hypothesized speaker model for H1 Alternative (background) model for H0
slide from Douglas Reynolds
How do we get H1?
Pool speech from several speakers and train a single model: a universal background model (UBM)can train one UBM and use as H1 for all
speakersShould be trained using speech representing
the expected impostor speechSame type speech as speaker enrollment
(modality, language, channel)Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, Petrovska-Delacretaz, Reynolds
How to compute P(H|X)?
Gaussian Mixture Models (GMM)The traditional best model for text-
independent speaker recognitionSupport Vector Machines (SVM)
More recent use of discriminative model
Form of GMM/HMM depends on application
slide from Douglas Reynolds
GMMs for speaker recognitionA Gaussian mixture model
(GMM) represents features as the weighted sum of multiple Gaussian distributions
Each Gaussian state i has aMean CovarianceWeight
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1Dim 2
Model
( | )p x
iμ
i
Recognition SystemsGaussian Mixture Models
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Parameters iμ
i
iw
Dim 1Dim 2
( )p x
Recognition SystemsGaussian Mixture Models
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Model Components
Parameters
Dim 1Dim 2
( )p x
GMM trainingDuring training, the
system learns about the data it uses to make decisionsA set of features are
collected from a speaker (or language or dialect)
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Training Features
Dim 1Dim 2
Dim 1Dim 2
Model
1x2x
( )p x
Recognition Systems for Language, Dialect, Speaker ID
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Model Components
Languages,Dialects,
or Speakers
Parameters
1Model
2Model
3Model
( | )Cp x
Dim 1Dim 2
In LID, DID, and SID,we train a set of target models
for each dialect, language, or speakerC
Recognition SystemsUniversal Background Model
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Model Components
Parameters
( | )Cp x
We also train a universal backgroundmodel representing all speechC
Model C
Dim 1Dim 2
Recognition SystemsHypothesis Test
Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
0
1
: is from the hypothesized class
: is not from the hypothesized classtest
test
H X
H X
1 2{ , , , }test KX x x x
Dim 1Dim 2
Recognition SystemsHypothesis Test
Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
0
1
: is from the hypothesized class
: is not from the hypothesized classtest
test
H X
H X
1 2{ , , , }test KX x x x
0 ?H
1 ?H
1( | )p x
( | )Cp x
Dim 1Dim 2
Dim 1Dim 2
Dim 1Dim 2
Recognition SystemsHypothesis Test
Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
1 2{ , , , }test KX x x x
1( | )p x
( | )Cp x
Dim 1Dim 2
Dim 1Dim 2
Dim 1Dim 2
Dan?
UBM (not Dan)?
More details on GMMsInstead of training speaker model on only speaker dataAdapt the UBM to that speaker
takes advantage of all the dataMAP adaptation: new mean of each Gaussian is a weighted
mix of the UBM and the speakerWeigh the speaker more if we have more data:
μi =α Ei(x) + (1−α) μi α=n/(n+16)
Gaussian mixture modelsFeatures are normal MFCC
can use more dimensions (20 + deltas)UBM background model: 512–2048
mixturesSpeaker’s GMM: 64–256 mixturesOften combined with other classifiers
in mixture-of-experts
SVM
Train a one-versus-all discriminative classifierVarious kernelsCombine with GMM
Other features
ProsodyPhone sequencesLanguage Model features
Doddington (2001)
Word bigrams can be very informative about speaker identity
Evaluation Metric
Trial: Are a pair of audio samples spoken by the same person?
Two types of errors: False reject = Miss: incorrectly reject a true trial
Type-I error False accept: incorrectly accept false trial
Type-II errorPerformance is trade-off between these two errors
Controlled by adjustment of the decision threshold
slide from Douglas Reynolds
ROC and DET curves
slide from Douglas Reynolds
P(false reject) vs. P(false accept) shows system performance
DET curve
slide from Douglas Reynolds
Application operating point depends on relative costs of the two errors
Evaluation tasks
slide from Douglas Reynolds
Performance numbers depend on evaluation conditions
Rough historical trends in performance
slide from Douglas Reynolds
Milestones in the NIST SRE Program
1992 – DARPA: limited speaker id evaluation1996 – First SRE in current series2000 – AHUMADA Spanish data, first non-English speech2001 – Cellular data 2001 – ASR transcripts provided2002 – FBI “forensic” database2005 – Mutiple languages with bilingual speakers2005 – Room mic recordings, cross-channel trials2008 – Interview data2010 – New decision cost function: lower FA rate region2010 – High and low vocal effort, aging2011 –broad range of conditions, included noise and reverbFrom Alvin Martin’s 2012 talk on the NIST SR Evaluations
MetricsEqual Error Rate
Easy to understandNot operating point of interest
FA rate at fixed miss rateE.g. 10%May be viewed as cost of listening to false
alarmsDecision Cost Function
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Decision Cost Function CDet
Weighted sum of miss and false alarm error probabilities:
CDet = CMiss × PMiss|Target × PTarget
+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget)
Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, Ptarget:
CMiss CFalseAlarm PTarget
‘96-’08 10 1 0.012010 1 1 .001
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Accuracies
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
How good are humans?
Survey of 2000 voice IDs made by trained FBI employeesselect similarly pronounced wordsuse spectrograms (comparing formants, pitch, timing)listen back and forth
Evaluated based on "interviews and other evidence in the investigation" and legal conclusions
No decision 65.2% (1304)Non-match 18.8% (378) FR = 0.53% (2)Match 15.9% (318) FA = 0.31% (1)
Bruce E. Koenig. 1986. Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6)
Speaker diarization
Conversational telephone speech2 speakers
Broadcast newsMany speakers although often in dialogue (interviews)
or in sequence (broadcast segments).
Meeting recordingsMany speakers, lots of overlap and disfluencies
Tranter and Reynolds 2006
Speaker diarization
Tranter and Reynolds 2006
Step 1: Speech Activity Detection
Meetings or broadcast:Use supervised GMMs
two models: speech/non-speechor could have extra models for music, etc.
Then do Viterbi segmentation, possibly withminimum length constraints orsmoothing rules
TelephoneSimple energy/spectrum speech activity detection
State of the art:Broadcast: 1% miss, 1-2% false alarmMeeting: 2% miss, 2-3% false alarm Tranter and Reynolds 2006
Step 2: Change Detection1. Look at adjacent windows of data2. Calculate distance between them3. Decide whether windows come from same sourceTwo common methods:
To look for change points within window use likelihood ratio test to see if better modeled by one distribution or two. If two, insert change and start new window thereIf one, expand window and check again
represent each window by a Gaussian, compare neighboring windows with KL distance, find peaks in distance function, threshold
Tranter and Reynolds 2006
Step 3: Gender Classification
Supervised GMMsIf doing Broadcast news, also do bandwidth
classification (studio wideband speech versus narrowband telephone speech)
Tranter and Reynolds 2006
Step 4: ClusteringHierarchical agglomerative clustering1. initialize leaf clusters of tree with speech segments;2. compute pair-wise distances between each cluster;3. merge closest clusters;4. update distances of remaining clusters to new cluster;5. iterate steps 2-4 until stopping criterion is met
Tranter and Reynolds 2006
Step 5: ResegmentationUse final clusters and non-speech modelsTo resegment data via Viterbi decodingGoal:
refine original segmentationfix short segments that may have been removed
Tranter and Reynolds 2006
TDOA featuresFor meetings, with multiple-microphonesTime-Delay-of-Arrival (TDOA) features
correlate signals from mikes and figure out time shiftused to sync up multiple microphonesand as a feature for speaker localization
assume the speaker doesn’t move, so they are near the same microphone
EvaluationSystems give start-stop times of speech segments with
speaker labels nonscoring “collar” of 250 ms on either side
DER (Diarization Error Rate)missed speech (% of speech in the ground-truth but not in the hypothesis)false alarm speech (% of speech in the hypothesis but not in the ground-truth) speaker error (% of speech assigned to the wrong speaker)
Recent mean DER for Multiple Distant Mikes (MDM): 8-10% Recent mean DER for SDM: 12-18%
Summary: Speaker Recognition Tasks
slide from Douglas Reynolds