An Intro to Speaker Recognition Nikki Mirghafori Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke.
Jan 04, 2016
An Intro to Speaker RecognitionNikki Mirghafori
Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke.
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Today’s class
•Interactive
•Measures of success for today:
•You talk at least as much as I do
•You learn and remember the basics
•You feel you can do this stuff
•We all have fun with the material!
2
Nikki Mirghafori 4/23/12EECS 225D -- Verification
A 10-minute “Project Design”• You are experts with different backgrounds. Your previous
startup companies were wildly successful. A large VC firm in the valley wants to fund YOUR next creation, as long as the project is in speaker recognition.
• The VC funding is yours, if you come up with some kind of a coherent plan/list of issues:
• What is your proposed application?
• What will be the sources of error and variability, i.e., technology challenges?
• What types of features will you use?
• What sorts of statistical modeling tools/techniques?
• What will be your data needs?
• Any other issues you can think of along your path?
3
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Extracting Information from Speech
SpeechRecognition
LanguageRecognition
SpeakerRecognition
Words
Language Name
Speaker Name
“How are you?”
English
James Wilson
Speech Signal
Goal: Automatically extract information transmitted in speech
signal
• What’s noise? what’s signal?
• Orthogonal in many ways
• Use many of the same models and tools
Nikki Mirghafori 4/23/12EECS 225D -- Verification5
Speaker Recognition Applications• Access control
– Physical facilities– Data and data networks
• Transaction authentication– Telephone credit card purchases– Bank wire transfers– Fraud detection
• Monitoring– Remote time and attendance logging– Home parole verification
• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)– Personalization
• Forensics– Voice sample matching
• Access control– Physical facilities– Data and data networks
• Transaction authentication– Telephone credit card purchases– Bank wire transfers– Fraud detection
• Monitoring– Remote time and attendance logging– Home parole verification
• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)– Personalization
• Forensics– Voice sample matching
Nikki Mirghafori 4/23/12EECS 225D -- Verification6
Tasks
•Identification vs. verification
•Closed set vs. open set identification
•Also, segmentation, clustering, tracking...
Nikki Mirghafori 4/23/12EECS 225D -- Verification7
Speaker Model Speaker Model DatabaseDatabase
Test SpeechTest Speech
Identification
Whose voice is it?Whose voice is it?
Closed-set Speaker Identification
Nikki Mirghafori 4/23/12EECS 225D -- Verification8
Speaker Model Speaker Model DatabaseDatabase
Test SpeechTest Speech
Identification
Whose voice is it?Whose voice is it?
Open-set Speaker Identification
None of the above
Nikki Mirghafori 4/23/12EECS 225D -- Verification9
Speaker Model Speaker Model DatabaseDatabase
Test SpeechTest Speech
Verification/Authentication/Detection
Does the voice match?Does the voice match?
Yes/NoYes/NoVerification requires claimant ID
““It’s It’s me!”me!”
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Speech Modalities
• Text-dependent recognition
– Recognition system knows text spoken by person
– Examples: fixed phrase, prompted phrase
– Used for applications with strong control over user input
– Knowledge of spoken text can improve system performance
• Text-independent recognition
– Recognition system does not know text spoken by person
– Examples: User selected phrase, conversational speech
– Used for applications with less control over user input
– More flexible system but also more difficult problem
– Speech recognition can provide knowledge of spoken text
– Text-Constrained recognition. Exercise for the reader.
Nikki Mirghafori 4/23/12EECS 225D -- Verification11
Text-constrained Recognition
•Basic idea: build speaker models for words rich in speaker information
•Example:
•“What time did you say? um... okay, I_think that’s a good plan.”
•Text-dependent strategy in a text-independent context
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Strongest security
• Biometric: a human generated signal or attribute for authenticating a person’s identity
• Voice is a popular biometric:
– natural signal to produce
– does not require a specialized input device
– ubiquitous: telephones and microphone equipped PC
• Voice biometric with other forms of security
– Something you have - e.g., badge
– Something you know - e.g., password
– Something you are - e.g., voice
HaveKnow
Are
Voice as a biometric
Nikki Mirghafori 4/23/12EECS 225D -- Verification13
How to build a system? • Feature choices:
• low level (MFCC, PLP, LPC, F0, ...) and high level (words, phones, prosody, ...)
• Types of models:
• HMM, GMM, Support Vector Machines (SVM), DTW, Nearest Neighbor, Neural Nets
• Making decisions: Log Likelihood Thresholds, threshold setting for desired operating point
• Other issues: normalization (znorm, tnorm), optimal data selection to match expected conditions, channel variability, noise, etc.
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Verification Performance
• There are many factors to consider in design of an evaluation of a speaker verification system
Speech quality – Channel and microphone characteristics– Noise level and type– Variability between enrollment and
verification speech
Speech modality – Fixed/prompted/user-selected phrases– Free text
Speech duration – Duration and number of sessions of enrollment and verification speech
Speaker population – Size and composition
• Most importantly: The evaluation data and design should match the target application domain of interest
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Verification Performance
Probability of False Accept (in %)
Pro
bab
ility
of
Fal
se R
ejec
t (i
n %
)
Text-dependent (Combinations)
Clean Data
Single microphone
Large amount of train/test speech
Text-dependent (Combinations)
Clean Data
Single microphone
Large amount of train/test speech
Text-independent (Conversational)
Telephone Data
Multiple microphones
Moderate amount of training data
Text-independent (Conversational)
Telephone Data
Multiple microphones
Moderate amount of training data
Text-dependent (Digit strings)
Telephone Data
Multiple microphones
Small amount of training data
Text-dependent (Digit strings)
Telephone Data
Multiple microphones
Small amount of training data
Text-independent (Read sentences)
Military radio Data
Multiple radios & microphones
Moderate amount of training data
Text-independent (Read sentences)
Military radio Data
Multiple radios & microphones
Moderate amount of training data
Incre
asing constra
ints
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Verification Performance
PROBABILITY OF FALSE ACCEPT (in %)
PR
OB
AB
ILIT
Y O
F F
AL
SE
RE
JEC
T (
in %
)
Equal Error Rate (EER) = 1 %
Wire Transfer:
False acceptance is very costly
Users may tolerate rejections for
security
Customization:
False rejections alienate customers
Any customization is beneficial
Application operating point depends on
relative costs of the two error types
High Convenience
High Security
Balance
Example Performance Curve
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Human vs. Machine
17
• Motivation for comparing human to machine– Evaluating speech coders
and potential forensic applications
• Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)
– Same amount of training data
– Matched Handset-type tests
– Mismatched Handset-type tests
– Used 3-sec conversational utterances from telephone speech
ErrorRates
Humans44%
betterHumans15%worse
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Features
• Desirable attributes of features for an automatic system (Wolf ‘72)
• Occur naturally and frequently in speech
• Easily measurable
• Not change over time or be affected by speaker’s health
• Not be affected by reasonable background noise nor depend on specific transmission characteristics
• Not be subject to mimicry
• Occur naturally and frequently in speech
• Easily measurable
• Not change over time or be affected by speaker’s health
• Not be affected by reasonable background noise nor depend on specific transmission characteristics
• Not be subject to mimicry
Practical
Robust
Secure
• No feature has all these attributes
Nikki Mirghafori 4/23/12EECS 225D -- Verification19
Training & Test Phases
FeatureFeatureExtractiExtracti
onon
ModelModelTraininTrainin
gg
Enrollment Phase
Training speech for each speaker
Model for each speake
r
FeatureFeatureExtractiExtracti
onon
VerificVerificationation
DecisioDecisionn
Recognition Phase(e.g. Verification)
??
““It’s It’s me!”me!”
AccepteAcceptedd
Reject Rejecteded
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Decision makingVerification decision approaches have roots in signal detection theory
• 2-class Hypothesis test: H0: the speaker is an impostor
H1: the speaker is indeed the claimed speaker.
• Statistic computed on test utterance S as likelihood ratio:
Likelihood S came from speaker modelLikelihood S did not come from speaker model
log
reject
Feature extraction
Feature extraction
SpeakerModel
SpeakerModel
ImpostorModel
ImpostorModel
DecisionDecision+
-
accept
Nikki Mirghafori 4/23/12EECS 225D -- Verification21
Decision making
• Identification: pick model (of N) with best score
• Verification: usual approach is via likelihood ratio tests, hypothesis testing, i.e.:
• By Bayes:
• P(target|x)/P(nontarget|x) =
P(x|target)P(target)/P(x|nontarget)P(nontarget)
• accept if > threshold, reject otherwise
• Can’t sum over all non-target talkers -- world for SV!
• Use “cohorts” (collection of impostors)
• Train “universal”/”world”/”background” model (speaker independent, it’s trained on many speakers)
Nikki Mirghafori 4/23/12EECS 225D -- Verification22
•Traditional speaker recognition systems use
•Cepstral feaures
•Gaussian Mixture Models (GMMs)
Spectral Based Approach
D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1--3), January/April/July 2000
FeatureFeatureExtractExtractionion
FourierFourierTransforTransfor
mm
MagnituMagnitudede
Sliding Sliding windowwindow
CosineCosineTransforTransfor
mmLogLog
BackgrouBackgroundnd
ModelModel
log log likelihood likelihood
ratioratio
SpeakerSpeakerModelModelAdaptAdapt
Nikki Mirghafori 4/23/12EECS 225D -- Verification23
Features: Levels of Information
Semantic
Dialogic
Idiolectal
Phonetic
Prosodic
SpectralLow-level cues(physical
characteristics)
High-level cues(learned behaviors)
semantics, idiolects,
pronunciations,
idiosyncrasies
socio-economic status,
education, place of birth
prosody, rhythm, speed intonation,
volume modulation
personality type,
parental influence
acoustic aspects of speech,
nasal, deep, breathy, rough
anatomical structure of
vocal apparatus
Hierarchy of Perceptual
Cues
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Low level features
•Speech production model: source-filter interaction
•Anatomical structure (vocal tract/glottis) conveyed in speech spectrum
Glottal pulses Vocal tract Speech signal
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Word N-gram Features
25
Idea (Doddington 2001):
•Word usage can be idiosyncratic to a speaker
•Model speakers by relative frequencies of word N-grams
•Reflects vocabulary AND grammar
•Cf. similar approaches for authorship and plagiarism detection on text documents.
•First (unpublished) use in speaker recognition: Heck et al. (1998)
Implementation:
•Get 1-best word recognition output
•Extract N-gram frequencies
•Model likelihood ratio OR
•Model frequency vectors by SVM
I_shall 0.002
I_think 0.025
I_would 0.012
… …
Nikki Mirghafori 4/23/12EECS 225D -- Verification26
Phone N-gram features
Open-loop phone
recognition
zh eh
k
jh
phone lattice phonengram
jh
zh eh
k
relative
freq.
0.0254
0.0068
0.0198
SupportVector Machine(SVM)
[+ 0.0254 0.0068 0.0198][- 0.0001 0.8827 0.7264][- 0.0329 0.2847 0.2983]
score
Model the pattern of phone usage or “short term pronunciation” for a speaker
Nikki Mirghafori 4/23/12EECS 225D -- Verification
MLLR transform vectors as features
27
Speaker-independent
Speaker-independent
Phone class APhone class B
Speaker-dependent
Speaker-dependent
MLLR Transforms = Features
Nikki Mirghafori 4/23/12EECS 225D -- Verification28
Models• HMMs:
• text dep (could use whole word/phone model)
• prompted (phone models)
• text ind’t (use LVCSR) -- or GMMs!
• templates DTW (if text-dependent system)
• nearest neighbor: frame level, training data as “model”, non-parametric
• neural nets: train explicitly discriminating models
• SVMs
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Speaker Models -- HMM
• Speaker models (voiceprints) represent voice biometric in compact and generalizable form
h-a-d
• Modern speaker verification systems use Hidden Markov Models (HMMs)
– HMMs are statistical models of how a speaker produces sounds
– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.
– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Speaker Models – HMM/GMMForm of HMM depends on the application
“Open sesame”
Fixed Phrase Word/phrase models
/s/ /i/ /x/
Prompted phrases/passwords Phoneme models
General speech
Text-independent single state HMM
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Word N-gram Modeling: Likelihood Ratios
31
j
j Background
Spea
j
j
Score1
)(
)(log ker
• Model N-gram token log likelihood ratio
• Numerator: speaker language model estimated from enrollment data
• Denominator: background language model estimated from large speaker population
• Normalize by token count
• Choose all reasonably frequent bigrams or trigrams, or a weighted combination of both
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Speaker Recognition with SVMs
32
• Each speech sample (training or test) generates a point in a derived feature space
• The SVM is trained to separate the target sample from the impostor (= UBM) samples
• Scores are computed as the Euclidean distance from the decision hyperplane to the test sample point
• SVMs training is biased against misclassifying positive examples (typically very few, often just 1)
Background sampleTarget sampleTest sample
Nikki Mirghafori 4/23/12EECS 225D -- Verification
Feature Transforms for SVMs• SVMs have been a boon for higher-level (as well
as cepstral speaker recognition) research – they allow great flexibility in the choice of features
• However, we need a “sequence kernel”
• Dominant approach: transform variable-length feature stream into fixed, finite-dimensional feature space
• Then use linear kernel
• All the action is in the feature transform!
33
Nikki Mirghafori 4/23/12EECS 225D -- Verification34
Combination of Systems• Systems work best in combination, especially ones using “higher level” features
•Need to estimate optimal combination weight. E.g., use neural network
•Combination weights trained on a held-out development dataset
GMMGMM MMLRMMLR WordHMMWordHMM PhoneNgrPhoneNgramam
Neural Network CombinerNeural Network Combiner
Nikki Mirghafori 4/23/12EECS 225D -- Verification35
Variability: The Achilles Heel...•Variability (extrinsic & intrinsic) in the spectrum can cause error
•Data of focus has mainly been extrinsic
•“Channel” mismatch:• Microphone •carbon-button, hands-free,..
•Acoustic environment•Office, car, airport, ...
•Transmission channel•Landline, cellular, VoIP, ...
•Three compensation approaches:•Feature-based•Model-based•Score-based '96 '99
Matched
Handsets
Mismatched
Handsets
ErrorRates
Factor of 20worse
Factor of 2.5worse
Compensation techniqueshelp reduce error.
Nikki Mirghafori 4/23/12EECS 225D -- Verification
NIST Speaker Verification Evaluations
• Annual NIST evaluations of speaker verification technology (since 1996)
• Aim: Provide a common paradigm for comparing technologies
• Focus: Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider
Technology Developers
Comparison of technologies on
common task
Evaluate
Improve
Nikki Mirghafori 4/23/12EECS 225D -- Verification37
The NIST Evaluation Task•Conversational telephone speech, interview•Landline, cellular, hands-free, multiple-mics in room•5 min of conversations between two speakers•Various conditions, e.g.,•Training: 8, 1, or other number of conversation sides•Test: 1 conversation side, 30 secs, etc.•Evaluation:•Equal Error Rate (EER)•Decision Cost Function (DCF)• • • = (10, 1, 0.01)
Nikki Mirghafori 4/23/12EECS 225D -- Verification
The End
•What’s one interesting you learned today you may share with a friend over dinner conversation?
38
Nikki Mirghafori 4/23/12EECS 225D -- Verification40
Word Conditional Models -- example•Boakye et al. (2004)
•19 words and bi-grams• Discourse markers:
{actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean}
• Filled pauses: {um, uh}
• Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know }
•Trained whole-word HMMs, instead of GMMs, to model evolution of speech in time
•Combines well with low-level (i.e., cepstral GMM) system, especially with more training data
Nikki Mirghafori 4/23/12EECS 225D -- Verification41
Phone N-Grams -- example•Idea (Hatch et al., ‘05): model the pattern of phone usage or “short term pronunciation” for a speaker•Use open-loop phone recognition to obtain phone hypotheses
•Create models of relative frequencies of phone n-grams of the speaker vs. “others”
•Use SVM for modeling
•Combines well, esp. with increased data
•Works across languages