1 1 Centre for Vision, Speech and Signal Processing Speaker and Speech Recognition: Speaker Recognition and Verification Josef Kittler Centre for Vision, Speech and Signal Processing University of Surrey, Guildford GU2 7XH [email protected]www.ee.surrey.ac.uk/Personal/J.Kittler/lecturenotes 2 OUTLINE Introduction Terminology Problem formulation Speech representation Text independent methods Text dependent methods
28
Embed
Speaker and Speech Recognition: Speaker Recognition and ...info.ee.surrey.ac.uk/Teaching/Courses/eem.ssr/Speaker1_JK09.pdf · 2 3 Introduction Person identification is crucial to
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
1
Centre for Vision, Speech and Signal Processing
Speaker and Speech Recognition: Speaker Recognition and Verification
Josef Kittler Centre for Vision, Speech and Signal Processing
Biometric functionalities Verification Identification Screening (watch list) Retrieval (detection of multiple identities) Negation of denial
8
Historical notes
Bertillon system Advantages of biometrics
Convenience Fraud reduction
Disadvantages Otput is a score Cannot be replaced Open to attack Privacy concerns
5
9
Application Characteristics
Cooperative / non-cooperative Scalable / non-scalable Private / public Closed / open High-level / low-level security Habituated / non-habituated
10
Selection criteria
Universality (all users possess this biometric)
Permanence Uniqueness Collectability Acceptability Open to attack Failure to enroll
6
11
Voice based speaker recognition/verification, Why ? Voice is one of the main biometric modality used for personal Identity Recognition and Verification by humans. User friendly Natural interface Conveying emotion Can be used over telephone
Introduction
12
Terminology
Speaker identification -- using utterances from a speaker, determine who he/she is out of a set of known speakers
Speaker verification -- using utterances from a speaker, determine whether the caller is who he/she claims to be (requires an identity claim)
Training -- using utterances from a speaker to train a unique voiceprint that can later be used to identify/verify a speaker. Applies to both SI/SV.
7
13
Voice biometrics properties
Biometric Signal (Voice) Speaker verification is very compelling:
• Voice is convenient. • Voice is ubiquitous. • Voice is inexpensive. • Voice provides challenge-response security.
Unfortunately: • It is sometimes inconvenient because it does not work well ubiquitously, and requires
a back up solution
14
Voice Recognition Applications
Voice based identity verification for Web banking Physical access control Border control Identity cards/driver’s licence Personalisation
Entertainment Robotic toys
Law enforcement/surveillance Telephone surveillance
8
15
Internet
Smart Card
Client
Server
Card Reader
Services Provider
Server
Private Network
Card Reader
Microphone
Application to web banking
16
Biometric Applications
Forensic Government Commercial
Criminal investigation
Identity card/ passport
ATM
Driver’s licence E-commerce/ banking
Nomadic working
Mobile phone
Social security Personalisation
9
Market Outlook for the Future
* Voice ID: Applications and Markets for the New Millennium 1999 J. Markowitz, Consultants
Speech Recognition
Speech Processing
Speech Synthesis Digitized Speech
Input
Speaker Verification
Speaker Identification
Voice Biometrics
…
10
19
Voice template
Set of features extracted from the raw biometric data during enrolment
Represents typical values of voice biometric
Multiple templates may be stored to account for intra class variability
Template issues Aging (maintenance) Central/distributed storage Privacy and protection
20
Deficiencies of existing commercial solutions
Sensitive to speech acquisition conditions
Sensitive to background noise Sensitive to emotional state Sensitive to physical state of
the user
• Quite effective for closed set applications under tightly controlled voice acquisition conditions
Large number of classes Segmentation Noise and distortion Variability of deployed microphones
(interoperability) Population coverage and scalability System performance System attacks Aging Non-uniqueness of biometrics Privacy concerns (smart cards)
12
23
System Attack
impersonation
decision
speech data features
template
MFCC GMM
24
Generic Architecture
Silence detection
Energy norm.
Feature extraction
Recogniton
Background noise removal
Phoneme/speech recognition
Related processes
13
25
Privacy issues
Template protection Biometrics can be used to track
people (secretely)-violation of their right to privacy (big brother)
Biometric data may be used for other than intended purposes
Biometric database linking
26
Evaluation protocol
Test data and procedures adopted to evaluate a biometric system
Evaluation should be conducted by an independent body
Test and biometric data used should not have previously been seen by the system
Data use cases Training set Evaluation set Test set
Data set sizes should allow statistically significant evaluation
14
27
XM2VTS database
XM2VTS database Face images and speech recording
of 295 people Subjects recorded in 4 sessions
Lausanne Protocol
28
Performance Evaluation
Performance criteria Failure to enrol Accuracy Speed Storage Costs Ease of use Failure to acquire
15
29
Accuracy
Measured in terms of False rejections/identification False acceptances
Falsely accepted users are impostors Performance characterisation issues
Genuine ambiguity Confidence Competence
30
Performance characterisation (verification)
False rejection False acceptance Total error rate/Half total error rate Operating point
Equal error rate (civilian) Zero false acceptance (high security, forensic)
Test set/evaluation set Receiver operating characteristic
16
31
Performance characterisation (identification)
Confusion matrix
32
Reference
1.Douglas A. Reynolds, Thomas F. Quatieri, Robert B. Dunn,
Speaker Verification Using Adapted Gaussian Mixture Models.
Digital Signal Processing. 10(2000), 19-41.
2.Martin A., Doddington G., Kamm T., Ordowski M., Przybocki M.
The DET curve in assessment of detection task performance.
Eurospeech 97, 1895-1898
17
33
cepstrum and delta cepstrum
coefficients
A/D Converter
Silence Detector
LPC Analysis
Preprocessing and feature extraction
Hamming Window
34
Speech input and spectra
Client
Impostor
18
35
Speech representation
MFCC feature vectors (24 filterbank analysis), with delta coefficients and delta log-energy appended (2 coefficient-window)
33 component feature vectors Energy normalisation :
36
FFT-based signal Spectrum
LP Spectrum Spectrum
derived from LP-Cepstrum
Cepstral Processing Spectrum
Amplitude (dB)
Hz
19
37
Speaker verification problem
Consider that the system has been trained using samples of the input waveform provided by the client.
Each sample is represented by a feature vector
The training speech segment is long enough to create a representative model
for each client i and for all speakers
38
Hypothesis testing
Now a test speech segment is acquired from a speaker claiming to be client
Given a feature vector corresponding to waveform sample , the probability that the claim is true is given as
20
39
Likelihood Ratio
The claim will be accepted if Assuming the priors are equal, the test
becomes
This is also referred to as the likelihood ratio
We base the decision on more than one sample, hence on
40
Assuming the samples are independent and identically distributed, we can express the joint probability density (i.e. likelihood) in terms of marginals as
Taking a log we find the loglikelihood
And the loglikelihood ratio as
21
41
Discussion
The independence assumption may not be satisfied in practice
The log likelihood is a function of the number of samples. It may be desirable to perform normalisation by factor
For a large (infinite) sample, the summation will asymptotically become ‘integration’
where is the test sample density Loglikelihood has a meaning only in relative sense
42
Gaussian model
Assuming
The feature vectors are assumed to be normally distributed with mean and the covariance matrix
22
43
Substituting and taking natural log of the client density we find
The left hand side of the inequality can be expanded as
44
The first term can be rewritten as
Thus the decision rule finally becomes
Its symmetric form
23
45
Notes
# samples for both, model and probe should be as large as possible
Ignoring means we have
If model and probe match, , the product of the two matrices is an identity matrix, i.e. isotropic distribution. Hence the matching criterion measures ‘sphericity’.
It is a sphericity measure
46
Other sphericity measures
In essence, in matching a probe and a model, we are measuring the distance between two gaussian probability densities
Any feature selection criterion could be used for that purpose
The derived sphericity measure resembles divergence
Bhattacharrya measure
24
47
Decision threshold
The matching process maps multidimensional speech data into 1D space
In theory, the decision threshold could be derived from the known parameters
In practice The distributions will not be exactly gaussian The parameters are estimates subject to error We may wish to control the trade-off between false