Page 1
1
Cours parole du 9 Mars 2005enseignants: Dr. Dijana Petrovska-Delacrétaz
et Gérard Chollet
Reconnaissance du locuteur
1. Introduction, Historique, Domaines d’applications
2. Les indices de l’identité dans la parole
3. Vérification du locuteur1. Théorie de la decision
2. Dépendante / Indépendante du texte
4. L’imposture vocale
5. Vérification audio-visuelle de l’identité
6. Evaluations
7. Conclusions
Page 2
2
Why should a computer recognize who is speaking ?
• Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...)
• Limited access (secured areas, data bases)
• Personalization (only respond to its master’s voice)
• Locate a particular person in an audio-visual document (information retrieval)
• Who is speaking in a meeting ?
• Is a suspect the criminal ? (forensic applications)
Page 3
3
Tasks in Automatic Speaker Recognition
• Speaker verification (Voice Biometrics) Are you really who you claim to be ?
• Identification (Speaker ID) : Is this speech segment coming from a known speaker ? How large is the set of speakers (population of the
world) ? • Speaker detection, segmentation, indexing, retrieval, tracking :
Looking for recordings of a particular speaker• Combining Speech and Speaker Recognition
Adaptation to a new speaker, speaker typology Personalization in dialogue systems
Page 4
4
Applications
• Access ControlPhysical facilities, Computer networks, Websites
• Transaction AuthenticationTelephone banking, e-Commerce
• Speech data ManagementVoice messaging, Search engines
• Law EnforcementForensics, Home incarceration
Page 5
5
Voice Biometric
• AvantagesOften the only modality over the telephone,Low cost (microphone, A/D), UbiquityPossible integration on a smart (SIM) card Natural bimodal fusion : speaking face
• DisadvantagesLack of discretionPossibility of imitation and electronic impostureLack of robustness to noise, distortion,…Temporal drift
Page 6
6
Speaker Identity in Speech• Differences in
Vocal tract shapes and muscular controlFundamental frequency (typical values)
100 Hz (Male), 200 Hz (Female), 300 Hz (Child)Glottal waveformPhonotacticsLexical usage
• The differences between Voices of Twins is a limit case• Voices can also be imitated or disguised
Page 7
7
spectral envelope of / i: /
f
A
Speaker A
Speaker B
Speaker Identity
• segmental factors (~30ms) glottal excitation:
fundamental frequency, amplitude,voice quality (e.g., breathiness)
vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef)
• suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits
Page 8
8
What are the sources of difficulty ?
• Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)
• Recording conditions (filtering, noise,…)
• Channel mismatch between enrolment and testing
• Temporal drift
• Intentional imposture
• Voice disguise
Page 9
9
Acoustic features
• Short term spectral analysis
Page 10
10
Intra- and Inter-speaker variability
Page 11
11
Speaker Verification
Typology of approaches (EAGLES Handbook) Text dependent
Public password Private password Customized password Text prompted
Text independent Incremental enrolment Evaluation
Page 12
12
History of Speaker Recognition
Page 13
13
Current approaches
Page 14
14
Dynamic Time Warping (DTW)
Best path
),()Y,X( 2jid yx
“Bonjour” locuteur test Y
“Bon
jour
” lo
cute
ur X
“Bonjour” locuteur 1
“Bonjour” locuteur 2
“Bonjour” locuteur n
DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.
Page 15
15
Vector Quantization (VQ)
bestquant.
),()Y,X( X2
jiCd y
Dictionnaire locuteur 1
Dictionnaire locuteur 2
Dictionnaire locuteur n
“Bonjour” locuteur test Y
Dic
tionn
aire
locu
teur
X
SOONG, ROSENBERG 1987
Page 16
16
Hidden Markov Models (HMM)
Bestpath
)S(Plog)Y,X(iXjy
“Bonjour” locuteur 1
“Bonjour” locuteur 2
“Bonjour” locuteur n
“Bonjour” locuteur test Y
“Bon
jour
” lo
cute
ur X
ROSENBERG 1990, TSENG 1992
Page 17
17
Ergodic HMM
Best path
)S(Plog)Y,X(iXjy
HMM locuteur 1
HMM locuteur 2
HMM locuteur n
“Bonjour” locuteur test Y
HM
M lo
cute
ur X
PORITZ 1982, SAVIC 1990
Page 18
18
Gaussian Mixture Models (GMM)
REYNOLDS 1995
Page 19
19
HMM structure depends on the application
Page 20
20
Some issues in Text-dependent Speaker Verification Systems :
The CAVE and PICASSO projects
• Sequences of digitsSpeaker independent HMM of each digitAdaptation of these HMMs to the client voice (during
enrolment and incremental enrolment)EER of less than 1 % can be achieved
• Customized passwordThe client chooses his password using some feedback from
the system• Deliberate imposture
Page 21
21
Gaussian Mixture Model
• Parametric representation of the probability distribution of observations:
Page 22
22
Gaussian Mixture Models
8 Gaussians per mixture
Page 23
23
GMM speaker modeling
Front-endGMM
MODELING
WORLDGMM
MODEL
Front-end GMM model adaptation
TARGETGMM
MODEL
Page 24
24
Baseline GMM method
HYPOTH.TARGET
GMM MOD.
Front-end
WORLDGMM
MODEL
Test Speech
xPxPLog ]
)/()/([
LLR SCORE
)/( xP
)/( xP
=
Page 25
25
• Two types of errors :False rejection (a client is rejected)False acceptation (an impostor is accepted)
• Decision theory : given an observation O and a claimed identity
H0 hypothesis : it comes from an impostorH1 hypothesis : it comes from our client
• H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as
Decision theory for identity verification
)1()(
)(
)1(
HPHoP
HoOP
HOP
Page 26
26
Signal detection theory
Page 28
28
Distribution of scores
Page 29
29
Detection Error Tradeoff (DET) Curve
Page 30
30
Evaluation
• Decision cost (FA, FR, priors, costs,…)
• Receiver Operating Characteristic Curve
• Reference systems (open software)
• Evaluations (algorithms, field trials, ergonomy,…)
Page 31
31
NIST Speaker Verification Evaluations• A reference standard to compare algorithms and stimulate
new developments• Distribution (via LDC) of development and test databases
with :Increasing difficulty (from land line to mobile)Several hundreds of speakers (2 mn of training
data per client),Several thousands test accesses (5 to 50 sec per
access),• Participation of 15-20 labs every year (MIT, IBM, Nuance,
Queensland Univ, ELISA consortium,….)• Annual workshop, Special issues in Journals, …
Page 32
32
National Institute of Standards & Technology (NIST)Speaker Verification Evaluations
• Annual evaluation since 1995• Common paradigm for comparing technologies
Page 33
33
Speaker Verification (text independent)
• The ELISA consortiumENST, LIA, IRISA, ...http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html
• BECARS : Balamand-ENST CEDRE Automatic Recognition of Speakers
• NIST evaluationshttp://www.nist.gov/speech/tests/spk/index.htm
Page 34
34
NIST evaluations : Results
ENST 2003
Page 35
35
Evaluations: NIST 2004
Page 36
36
Combining Speech Recognition and Speaker Verification.
• Speaker independent phone HMMs
• Selection of segments or segment classes which are speaker specific
• Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)
Page 37
37
ALISP : Automatic Language Independent Speech ProcessingData-driven speech segmentation
Page 38
38
Searching in client and world speech dictionaries for speaker verification purposes
Page 40
40
Fusion results
Page 41
41
Voice Transformations and Forgery (occasional, dedicated)
• Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems
• Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available
• Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures
Prevention by predicting many different forgery scenarios
Page 42
42
Voice Forgery using ALISP
The same words or not
Impostor
The same words or not
client
transformation
A modification of a source speaker‘s speech to imitate a target speaker
Page 43
43
Conversion system: ALISP encoder
Speech
MFCC analysis
HNM
HMM recognition
Harmonic envelope
Symbol index
- Representative index- DTW path
Choice of the best representative
unit
Prosody (energy+pitch)
MFCC + delta
Database of HNM Representatives
HMM models
Noise envelope
Page 44
44
Conversion system: ALISP Decoder
Concatenation of HNM
parameters for each
representative
HNM Synthesis
Speech signalSymbol index
Pitch, energy, timing
Representative index
DTW path
Page 45
45
Preliminary results: DET curves
• Fabefore forgery: 16 ± 2.0 % (1700 files)
• Faafter forgery: 26 ± 2.0 % (1700 files)
Page 46
46
Preliminary results
True distributions
Page 47
47
Multimodal Identity Verification
• M2VTS (face and speech)front view and profilepseudo-3D with coherent light
• BIOMET:
(face, speech, fingerprint, signature, hand shape)data collectionreuse of the M2VTS and DAVID data basesexperiments on the fusion of modalities
Page 48
48
Speaking Faces : Motivations
• In many situation a video sequence is acquired• Fusion of face and speech increases robustness• Forgery is more difficult
Page 49
49
Talking Face Recognition(hybrid verification)
Page 50
50
Lip features
• Tracking lip movements
Page 51
51
A talking face model
• Using Hidden Markov Models (HMMs)
Acoustic parameters
Visual parameters
Page 52
52
Imposture Model
Page 54
54
Conclusions, Perspectives
• Deliberate imposture is a challenge for speech only systems
• Verification of identity based on features extracted from talking faces should be developped
• Common databases and evaluation protocols are necessary
• Free access to reference systems will facilitate future developments
Page 55
55
BioSecure Residential Workshop
• Aug. 1st - 26th, 2005 in ENST, Paris• Reference systems for speech, face, talking face,
fingerprint, iris, hand, signature, …• Comparative evaluations on large databases (BIOMET,
BANCA, FVC,…)• Fusion of modalities
http://www.biosecure.info