APSIPA Asia-Pacific Signal and Information Processing Association APSIPA Asia-Pacific Signal and Information Processing Association Speaker Verification – The present and future of voiceprint based security Prof. Eliathamby Ambikairajah Head of School of Electrical Engineering & Telecommunications, University of New South Wales, Australia 30 June 2014
38
Embed
APSIPA Distinguished Lecture - University of New South …eemedia.ee.unsw.edu.au/contents/Ambi_APSIPA... · Asia-Pacific Signal and Information Processing Association ... APSIPA Distinguished
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
APSIPAAsia-Pacific Signal and Information Processing AssociationAPSIPAAsia-Pacific Signal and Information Processing Association
Speaker Verification –The present and future of voiceprint based security
Prof. Eliathamby AmbikairajahHead of School of Electrical Engineering & Telecommunications,
University of New South Wales, Australia
30 June 2014
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
APSIPA Distinguished Lecturer Programs
APSIPA founded in 2009 APSIPA distinguished lecturer program (DLP) is an educational program that:• Serves its communities by organizing lectures given by distinguished experts.
• Promotes the research and development of signal and information processing in Asia Pacific region.
• APSIPA distinguished lecturer is an ambassador of APSIPA to promote APSIPA's image and reach out to new membership.
APSIPA Distinguished Lecturer Program ‐http://www.apsipa.org/edu.htm
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam 4
Introduction
What can you infer from speech?
What is being said What language is being
spoken Who is speaking Gender of the speaker Age of the speaker Emotion
Stress level Cognitive load level Depression level Is the person sleepy Is the person inebriated
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
“How are you?”Introduction
• Speech conveys several types of information– Linguistic: message and language information– Paralinguistic : emotional and physiological characteristics
Speech Recognition
Language Recognition
Speaker Recognition
Emotion Recognition
Cognitive load Estimation
“How are you?” English Hsing Ming Happy Low load
Linguistic Paralinguistic
5
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Introduction
Speech Recognition
Language Recognition
Speaker Recognition
Emotion Recognition
Cognitive load Estimation
“How are you?”
6
Speaker Identification
determines who is speaking given a set of
enrolled speakers
Speaker Verification
determines if the unknown voice is from the claimed speaker
Speaker Diarization
partition an input audio stream into
homogeneous segments according to the speaker
identity
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam 7
Speaker Identification
determines who is speaking given a set of
enrolled speakers
Speaker Verification
determines if the unknown voice is from the claimed speaker
Speaker Diarization
partition an input audio stream into
homogeneous segments according to the speaker
identity
Speaker 1
Speaker 2
Speaker 1
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Speaker Verification Applications ‐ Biometrics
8
Physical facilities
Access control
Telephone credit card purchases
Transaction authentication
Reference: I2R @ Singapore Reference: Youtube
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Speaker Verification System – Basic Overview
• In automatic speaker verification, – The front‐end converts speech signal into a more convenient representation (typically a set of feature vectors)
– The back‐end compares this representation to a model of a speaker to determine how well they match
9
Feature Extraction ClassificationSpeech
Speaker Model
Decision Making Accept/Reject
Front‐end Back‐end
UBM: represent general, speaker independent model to be compared against a person‐specific model when making an accept or reject decision.
Speaker Verification SystemI am John
Determine level of Match
Determine level of Match
Likelihood of Generic Male
Likelihood of John
Likelihood Ratio Decision Making
NOT JOHN (i.e. reject)
Feature Extraction
Speaker Verification System – Speaker Enrolment
11
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Detailed Speaker Verification System
12
Front‐end: Feature Extraction
13
-5 0 5
-5 0 5
c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2
c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2
*n is usually 13 for speaker verification
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam 14
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Detailed Speaker Verification System
15
Speaker Modelling
-8 -6 -4 -2 0 2 4 6 8 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
16
4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 93
4
5
6
7
8
9
45
67
89
3
4
5
6
7
8
9
0
0.2
0.4
Dim
ensi
on 1
(C0)
Example of 1 dimensional probability density function
Probability density function approximated by 3‐component Gaussian mixture models
Each Gaussian mixture consist of a mean ( ), covariance ( ) and weight ( )
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Database for creating UBM (example)
• Training set– 56 male speakers (each speaker consists of 2 minutes of active speech) for creating the UBM
• Target set– 20 male speakers (each speaker consists of 2 minutes of active speech) for speaker‐specific model
• Test set– 250 male utterances (each speaker has many test utterances) with the known identity
17
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Databases• The data involved in building an automatic speaker verification
system is split into three sets:
• huge dataset containing speech from many speakers from the expected target population
• used to train the UBM, subspace models and/or to draw impostor models for score normalization
Background set
• an evaluation database that is used for tuning system parameters to ensure classification performance is maximized for the expected conditions of audio acquisition
Development set
• second independent evaluation database that is used to evaluate the final system (that was optimized on the development set)
Evaluation set
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam 19
UBM
Target Speaker Data
TargetMode
Mean = 0.9
Covariance = 0.9
Mean = 0.8
Covariance = 0.5
Weight = 0.2
Weight = 0.3
Feature Dimension 1
Feat
ure
Dim
ensi
on 2
Feature Dimension 1
Feat
ure
Dim
ensi
on 2
Universal Background Model (UBM) consists of 1024 Gaussian
mixtures
Target speaker model consists of 1024 Gaussian mixtures
Gaussian mixture consists of a mean ( ), covariance ( ) and weight ( )
12
1024
998
1024
998
Representing GMMs
20
The UBM and each speaker model is a GMM
Each of them will be represented by a vector of weights, a matrix of means and a matrix of covariances
GMM for Speaker x1
Mixture 1 Mixture 2 Mixture 1024
1x1024 Weight vector
39x1024 Means matrix
39x1024 Covariances matrix
GMM REPRESENTATION
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Model Representation ‐ Supervectors
• Each GMM (speaker model) can be represented by a supervector.
• Supervector is formed by concatenating all individual mixture means, scaled by corresponding weights and covariances
• Speech of different durations can be represented by supervectors of fixed size.
• Model normalisation is typically carried out on supervectors
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Detailed Speaker Verification System
22
Decision Making
23
Speaker Models
Speaker 1 Model
John’s Model
Likelihood S came from speaker modelLikelihood S did not come from speaker model
Score, L = log
≶
Reject( )/Accept( )
Determine level of Match
Determine level of Match
Likelihood of Generic Male Likelihood of John
Feature Extraction
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Score Normalisation
Normalised scores (L) are comparable
24
Different Systems perform speaker
verification in parallel
May not fall in the same range. i.e., NOT directly comparable
APSIPAAsia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ HCMUT, Vietnam
Fusion
25
Final score will be a weighted sum of score from each system
Speaker Verification System 1
Speaker Verification System 2
Speaker Verification System N
Speech
Score Normalisation
Score Normalisation
Score Normalisation
Normalised Score 1, s1
Normalised Score 2, s2
Normalised Score N, sN
Final Score = w1s1 + w2s2 +…+wNsN
Fusion
Performance measure
• Types of error:– Misses: valid identity is rejected
o Probability of miss: ratio of the number of falsely rejected true speakers to the total number of correct speaker trials.
– False acceptance: invalid identity is acceptedo Probability of false acceptance: ratio of the number of falsely accepted speakers to the total number of impostor trials