Speaker and Speech Recognition: Speaker Recognition and ...info.ee.surrey.ac.uk/Teaching/Courses/eem.ssr/Speaker1_JK09.pdf · 2 3 Introduction Person identification is crucial to

1

1

Centre for Vision, Speech and Signal Processing

Speaker and Speech Recognition: Speaker Recognition and Verification

Josef Kittler Centre for Vision, Speech and Signal Processing

University of Surrey, Guildford GU2 7XH

[email protected]

www.ee.surrey.ac.uk/Personal/J.Kittler/lecturenotes

2

OUTLINE

  Introduction  Terminology  Problem formulation  Speech representation  Text independent methods

 Text dependent methods

2

3

Introduction

 Person identification is crucial to the fabric of the society  Security  Access to services  Business transactions   Law enforcement  Border control

4

Establishing Identity

 One or more of the following  What entity knows (eg. password)  What entity has (eg. badge, smart

card)  What entity is (eg. fingerprints, retinal

characteristics)  Where entity is (eg. In front of a

particular terminal)

3

5

Authentication Overview

Authen-ticator Types

Proof Defense Traditional Example

Digital Example

Flaw

Secret

Secrecy, obscurity

Closely kept

Combo lock

Computer password

Less secret

with each use

Token

Posses-sion

Closely held

Metal key Key-less car entry

Insecure if lost

ID Unique-

ness Copy-

resistant Drivers license

Biometric access

Difficult to

replace

6

Authentication Overview

  Authenticator Subtypes: 1.  Secret

- secrecy, e.g., password - obscurity, e.g., mother’s maiden name, SSN

2.  Token - active, e.g., synchronised password generator - passive, e.g., smart card password storage

3.  ID - inalterable, e.g., fingerprint, face, hand, eye - alterable biometric signal, e.g., voice, keystroke, signature

4

7

Biometric functionalities

  Biometrics- a means to prevent identity theft

  Biometric functionalities   Verification   Identification   Screening (watch list)   Retrieval (detection of multiple identities)   Negation of denial

8

Historical notes

  Bertillon system   Advantages of biometrics

  Convenience   Fraud reduction

  Disadvantages   Otput is a score   Cannot be replaced   Open to attack   Privacy concerns

5

9

Application Characteristics

 Cooperative / non-cooperative  Scalable / non-scalable  Private / public  Closed / open  High-level / low-level security  Habituated / non-habituated

10

Selection criteria

 Universality (all users possess this biometric)

 Permanence  Uniqueness  Collectability  Acceptability  Open to attack   Failure to enroll

6

11

 Voice based speaker recognition/verification, Why ?  Voice is one of the main biometric modality used for personal Identity Recognition and Verification by humans.  User friendly  Natural interface  Conveying emotion  Can be used over telephone

Introduction

12

Terminology

  Speaker identification -- using utterances from a speaker, determine who he/she is out of a set of known speakers

  Speaker verification -- using utterances from a speaker, determine whether the caller is who he/she claims to be (requires an identity claim)

  Training -- using utterances from a speaker to train a unique voiceprint that can later be used to identify/verify a speaker. Applies to both SI/SV.

7

13

Voice biometrics properties

  Biometric Signal (Voice) Speaker verification is very compelling:

•  Voice is convenient. •  Voice is ubiquitous. •  Voice is inexpensive. •  Voice provides challenge-response security.

Unfortunately: •  It is sometimes inconvenient because it does not work well ubiquitously, and requires

a back up solution

14

Voice Recognition Applications

  Voice based identity verification for   Web banking   Physical access control   Border control   Identity cards/driver’s licence   Personalisation

  Entertainment   Robotic toys

  Law enforcement/surveillance   Telephone surveillance

8

15

Internet

Smart Card

Client

Server

Card Reader

Services Provider

Server

Private Network

Card Reader

Microphone

Application to web banking

16

Biometric Applications

Forensic Government Commercial

Criminal investigation

Identity card/ passport

ATM

Driver’s licence E-commerce/ banking

Nomadic working

Mobile phone

Social security Personalisation

9

Market Outlook for the Future

* Voice ID: Applications and Markets for the New Millennium 1999 J. Markowitz, Consultants

Speech Recognition

Speech Processing

Speech Synthesis Digitized Speech

Input

Speaker Verification

Speaker Identification

Voice Biometrics

…

10

19

Voice template

  Set of features extracted from the raw biometric data during enrolment

  Represents typical values of voice biometric

  Multiple templates may be stored to account for intra class variability

  Template issues   Aging (maintenance)   Central/distributed storage   Privacy and protection

20

Deficiencies of existing commercial solutions

 Sensitive to speech acquisition conditions

 Sensitive to background noise  Sensitive to emotional state  Sensitive to physical state of

the user

•  Quite effective for closed set applications under tightly controlled voice acquisition conditions

11

21

Main causes of acoustic variation in speech

Channel Speaker recognition system

Speaker •  Voice quality •  Pitch •  Gender •  Dialect

Speaking style •  Stress/Emotion •  Speaking rate •  Lombard effect

Task/Context •  Man-machine dialogue •  Dictation •  Free conversation

Phonetic/ Prosodic context

Noise •  Other speakers •  Background noise •  Reverberations

Distortion Noise Echoes Dropouts

Microphone •  Distortion •  Electrical noise •  Directional characteristics

22

Voice recognition challenge

  Large number of classes   Segmentation   Noise and distortion   Variability of deployed microphones

(interoperability)   Population coverage and scalability   System performance   System attacks   Aging   Non-uniqueness of biometrics   Privacy concerns (smart cards)

12

23

System Attack

impersonation

decision

speech data features

template

MFCC GMM

24

Generic Architecture

Silence detection

Energy norm.

Feature extraction

Recogniton

Background noise removal

Phoneme/speech recognition

Related processes

13

25

Privacy issues

 Template protection  Biometrics can be used to track

people (secretely)-violation of their right to privacy (big brother)

 Biometric data may be used for other than intended purposes

 Biometric database linking

26

Evaluation protocol

  Test data and procedures adopted to evaluate a biometric system

  Evaluation should be conducted by an independent body

  Test and biometric data used should not have previously been seen by the system

  Data use cases   Training set   Evaluation set   Test set

  Data set sizes should allow statistically significant evaluation

14

27

XM2VTS database

  XM2VTS database   Face images and speech recording

of 295 people   Subjects recorded in 4 sessions

Lausanne Protocol

28

Performance Evaluation

 Performance criteria   Failure to enrol  Accuracy  Speed  Storage  Costs  Ease of use   Failure to acquire

15

29

Accuracy

 Measured in terms of   False rejections/identification   False acceptances

  Falsely accepted users are impostors  Performance characterisation issues

 Genuine ambiguity  Confidence  Competence

30

Performance characterisation (verification)

 False rejection  False acceptance  Total error rate/Half total error rate  Operating point

 Equal error rate (civilian)  Zero false acceptance (high security, forensic)

 Test set/evaluation set  Receiver operating characteristic

16

31

Performance characterisation (identification)

Confusion matrix

32

Reference

  1.Douglas A. Reynolds, Thomas F. Quatieri, Robert B. Dunn,

Speaker Verification Using Adapted Gaussian Mixture Models.

Digital Signal Processing. 10(2000), 19-41.

2.Martin A., Doddington G., Kamm T., Ordowski M., Przybocki M.

The DET curve in assessment of detection task performance.

Eurospeech 97, 1895-1898

17

33

cepstrum and delta cepstrum

coefficients

A/D Converter

Silence Detector

LPC Analysis

Preprocessing and feature extraction

Hamming Window

34

Speech input and spectra

  Client

  Impostor

18

35

Speech representation

 MFCC feature vectors (24 filterbank analysis), with delta coefficients and delta log-energy appended (2 coefficient-window)

 33 component feature vectors  Energy normalisation :

36

FFT-based signal Spectrum

LP Spectrum Spectrum

derived from LP-Cepstrum

Cepstral Processing Spectrum

Amplitude (dB)

Hz

19

37

Speaker verification problem

  Consider that the system has been trained using samples of the input waveform provided by the client.

  Each sample is represented by a feature vector

  The training speech segment is long enough to create a representative model

for each client i and for all speakers

38

Hypothesis testing

  Now a test speech segment is acquired from a speaker claiming to be client

  Given a feature vector corresponding to waveform sample , the probability that the claim is true is given as

 

20

39

Likelihood Ratio

  The claim will be accepted if   Assuming the priors are equal, the test

becomes

  This is also referred to as the likelihood ratio

  We base the decision on more than one sample, hence on

40

  Assuming the samples are independent and identically distributed, we can express the joint probability density (i.e. likelihood) in terms of marginals as

  Taking a log we find the loglikelihood

  And the loglikelihood ratio as

21

41

Discussion

  The independence assumption may not be satisfied in practice

  The log likelihood is a function of the number of samples. It may be desirable to perform normalisation by factor

  For a large (infinite) sample, the summation will asymptotically become ‘integration’

where is the test sample density   Loglikelihood has a meaning only in relative sense

42

Gaussian model

  Assuming

   

  The feature vectors are assumed to be normally distributed with mean and the covariance matrix

22

43

  Substituting and taking natural log of the client density we find

  The left hand side of the inequality can be expanded as

44

  The first term can be rewritten as

  Thus the decision rule finally becomes

  Its symmetric form

23

45

Notes

  # samples for both, model and probe should be as large as possible

  Ignoring means we have

  If model and probe match, , the product of the two matrices is an identity matrix, i.e. isotropic distribution. Hence the matching criterion measures ‘sphericity’.

  It is a sphericity measure

46

Other sphericity measures

  In essence, in matching a probe and a model, we are measuring the distance between two gaussian probability densities

  Any feature selection criterion could be used for that purpose

  The derived sphericity measure resembles divergence

  Bhattacharrya measure

24

47

Decision threshold

  The matching process maps multidimensional speech data into 1D space

  In theory, the decision threshold could be derived from the known parameters

  In practice   The distributions will not be exactly gaussian   The parameters are estimates subject to error   We may wish to control the trade-off between false

acceptances and false rejections

  Hence, decision threshold determined   Experimentally   Modelling score distributions

48

If s

> Threshold Reject the claimant

≤ Threshold Accept the claimant

Accept Reject Score s

 The selected threshold defines an operating point

25

49

ROC curve

  ROC – receiver operating characteristics   Defines a relationship between the operating

point, false acceptances and false rejections in verification

  DET curve- log scale ROC

50

Score normalisation

  It may be desirable to normalise the scores, e.g. for the purposes of fusion or threshold determination

  Possibilities   Map to posterior probabilities   Map to designated means, e.g. so that the client and

impostor means coincide with –1 and +1 respectively   Map so that the variance of the score is normalised   others

26

51

Score normalisation (cont)

 Min-max

 Scaling

 Z-score

52


  Median

  Double sigmoid

  Tanh

  Min-max, Z-score and tanh are efficient, median, double-sigmoid and tanh are robust

27

53


 Mapping to designated means (for verification)

54

Score normalisation: Aposteriori class probabilities

 Aposteriori class probabilities are automatically normalised to [0,1]

 Some systems compute a matching score , rather than

 Scores have to be normalised to facilitate fusion by simple rules   aposteriori probability estimate

28

55

Score distribution modelling

  Probability density function of authentic claims and impostors can be estimated   Parametric/nonparametric pdfs   e.g. gaussian pdf.

  Standard deviation for true claims is likely to be smaller than for impostors

  For distance type scores, the mean of true claim scores lower than the mean of impostors

56

Example

 

Speaker and Speech Recognition: Speaker Recognition and ...info.ee.surrey.ac.uk/Teaching/Courses/eem.ssr/Speaker1_JK09.pdf · 2 3 Introduction Person identification is crucial to

Documents