Top Banner
© 2009 IBM Corporation Bottleneck Features for Speaker Recognition Sibel Yaman 1 , Jason Pelecanos 1 , and Ruhi Sarikaya 2 1 IBM T. J. Watson Research Labs, Yorktown Heights, NY 2 Microsoft Corporation, Redmond, WA Odyssey 2012: The Speaker and Language Recognition Workshop
21

Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

Jul 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation

Bottleneck Features for Speaker Recognition

Sibel Yaman1, Jason Pelecanos1, and Ruhi Sarikaya2

1 IBM T. J. Watson Research Labs, Yorktown Heights, NY2 Microsoft Corporation, Redmond, WA

Odyssey 2012: The Speaker and Language Recognition Workshop

Page 2: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation2

Roadmap

Introduction

Bottleneck feature extraction 1) A conversation level training criterion

2) Incorporating a separate system in training

Experiments

Summary

Page 3: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation3

In the speech recognition literature:

Deep networks are shown to outperform HMMs (Seide 2012, etc.).

In the speaker recognition literature:

Many sites report ever-improving performance figures (Konig 1998, Garimella 2012).

The Big Picture

Page 4: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation4

Bottleneck Network Architecture

Stacked raw features from 0.5 seconds of speech

Speakers

BACK-PROPAGATION OF SPEAKER INFORMATION

FORWARD FEEDING OF INPUT FEATURE STATISTICS

An information bottleneck acts as a feature compressor (Konig 1998).

Page 5: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation5

Same mic – EER (%) Different mic – EER (%)0

5

10

MFCCTraditional NN

Using Neural Networks for Speaker Recognition

Feature extraction with neural networks traditionally performs relatively poorly.

We investigate approaches to make the performance comparison the other way around. :-)

Same mic – D08 Different mic - D080

20

40

60

MFCCTraditional NN

EER (%)

1000xmin DCF(2008)

-40.4 % rel.

-45.1 % rel.

-57.5 % rel.

-46.1 % rel.

Page 6: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation6

An Overview

We demonstrate two ways of exploiting the expressive power of deep networks:

1) The training is adjusted to the targeted performance evaluation metric.

2) Information from a separate system is incorporated in training.

Spectral Feature

Extraction

Linear Transformation via Differentiation

MFCCs,

Nonlinear Transformation via

Bottleneck Networks

BN Features

Speech Signal

Raw MFCCs

, ,...∆ ∆ ∆

Page 7: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation7

Roadmap

Introduction

Bottleneck feature extraction 1) A conversation level training criterion

2) Incorporating a separate system in training

Experiments

Summary

Page 8: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation8

Frame level training has limitations: – Learning the speaker is constrained to the context around the

current frame.

– A long context would explode the number of free parameters.

Conversation level training offers solutions:– The frames coming from one conversation are tied together so that

a single decision is made.

– The network size can be kept relatively small.

Frame vs. Conversation Level Training

Page 9: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation9

A log-likelihood ratio-based training criterion (Brummer 2005) is optimized

There is one target and (S-1) nontarget scores at the output layer.

(1) A Speaker RecognitionTraining Criterion

:target :nontarget( ) log(1 ) log(1 )NT u cu c

LLRT N

J e eα β + +− −Θ = + + +∑ ∑

Cost associated with target trials

Cost associated with nontarget trials

Page 10: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation10

We need a global constraint on the decision for the entire recording.

The scores are averaged at the output layer before the nonlinearity.

Conversation Level Training

( )1Lu ( )

2Lu ( )

3Lu ( )L

Hu

...

t=0( )1Lu ( )

2Lu ( )

3Lu ( )L

Hu

...

t=1( )1Lu ( )

2Lu ( )

3Lu ( )L

Hu

...

t=T

...

( )1Lu ( )

2Lu ( )

3Lu ( )L

Hu

...

t=1:T

Page 11: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation11

(2) Using a Separate System in Training

Scores from a separate system are incorporated in training.

...Standard

MFCC System

Calibration

BN score generation

Additional scores

Mnu

( 1)−W l

( ) ( 1) ( 1)( ) σ− −Θ =u Wl l l

( 1) ( 1)1 2( ) M

n nu uω σ ω κ− −′ Θ = + +W l l

in the training objective is replaced with

The term

Page 12: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation12

Score Calibration

The additional scores should have a log-likelihood ratio interpretation.

The score calibration is achieved by solving

The network is trained by solving

1 2

* * *1 2 1 2

, ,{ , , } arg min ( , , | fixed)LLRJ

ω ω κω ω κ ω ω κ= Θ

* * *1 2arg min ( | , , fixed)LLRJ ω ω κ

ΘΘ = Θ

Page 13: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation13

The Back-End System

Universal Background

Model Training

MAP-Adapted Speaker Modeling

Dimension Reduction

Probabilistic Linear

Discriminant Analysis (PLDA)

UBM Supervectors

i-vectors Recognition Scores

STATE-OF-THE-ART SPEAKER RECOGNITION SYSTEM

Bottleneck Feature Extraction

Page 14: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation14

Roadmap

Introduction

Bottleneck feature extraction 1) A conversation level training criterion

2) Incorporating a separate system in training

Experiments

Summary

Page 15: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation15

Experiments

We ran experiments on the same and different microphone tasks of NIST SRE 2010.

Microphone recordings were used in bottleneck network training.– 173 speakers in the training and validation sets– 4341 recordings in training and 865 recordings in validation

Network architecture:

294 dimensional input → 1000 x 42 x 500 → 173 speakers

Page 16: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

16

Processing of the Input and Output Features of the Network

Bottleneck Network

Decorrelated Bottleneck Features

● Input features are mean and variance normalized to better condition the network.

● The bottleneck features are decorrelated for modeling with diagonal covariance GMMs.

Decorrelation with PCA

Bottleneck Feature Extraction (before the nonlinearity)

Page 17: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation17

Effect of the Training Criterion

Same mic – EER (%) Different mic – EER (%)0

2

4

6

8

10

MFCCFrame level Conversation level

EER (%)

-30.0 % rel.

-34.2 % rel.

Same mic – D08 Different mic - D080

10

20

30

40

50

MFCCFrame level Conversation level

1000xmin DCF(2008)

-36.4 % rel.

-30.0 % rel.

Page 18: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation18

Dependence on Feature Size

Same mic – D08 Different mic - D080

10

20

30

40

50

BN – 42 dimBN – 60 dimBN – 100 dim

Same mic – EER (%) Different mic – EER (%)0

2

4

6

8

BN – 42 dimBN – 60 dimBN – 100 dim

EER (%)

1000xmin DCF(2008)

Page 19: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation19

Performance when Trained with Information from a Separate System

Same mic – EER (%) Different mic – EER (%)0

2

4

6

MFCCLinear Score CombinationIncorporating a Separate System

EER (%)

+14.0 % rel.+18.0 % rel.

Same mic – D08 Different mic - D080

10

20

30

MFCCLinear Score CombinationIncorporating a Separate System

1000xmin DCF(2008)

+12.0 % rel.+11.0 % rel.

Page 20: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation20

Summary

1) We showed how to train a neural network for use in the front-end of a speaker recognition system.

– A conversation level training criterion that minimizes a log-likelihood

ratio score-based cost function is developed.

2) We also showed how to use neural networks to exploit information from a separate system.

Page 21: Bottleneck Features for Speaker Recognition › 83fa › fa8ae1c48b6a4f5bfd410bb3… · for Speaker Recognition Feature extraction with neural networks traditionally performs relatively

© 2009 IBM Corporation21

Thank you!