Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1 Dr. Hagai.

Post on 22-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1

Dr. Hagai Aronowitz

IBM Haifa Research Lab

Presentation is available online at: http://aronowitzh.googlepages.com/

Intra-Class Variability Modeling for Speech Processing

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 2

Given labeled training segments from class + and class –, classify unlabeled test segments

Classification framework

1. Represent speech segments in segment-space

2. Learn a classifier in segment-space• SVMs• NNs• Bayesian classifiers• …

Speech ClassificationProposed framework

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 3

OutlineIntra-Class Variability Modeling for Speech Processing

1 Introduction to GMM based classification

2 Mapping speech segments into segment space

3 Intra-class variability modeling

4 Speaker diarization

5 Summary

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 4

GMM based speaker recognitionEstimate Pr(yt|S)

1. Train a universal background model (UBM) GMM using EM2. For every target speaker S:

Train a GMM GS by applying MAP-adaptation

Text-Independent Speaker RecognitionGMM-Based Algorithm [Reynolds 1995]

Assuming frame independence:

T

tT SySyy1t

1 Pr,...,Pr

?Pr SY

UBM

Q1 - speaker #1

Q2 - speaker #2

μ1 μ2 μ3

R26 MFCC feature space

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 5

1. Invalid frame independence assumption:

Factors such as channel, emotion, lexical variability, and

speaker aging cause frame dependency

2. GMM scoring is inefficient – linear in the length of the

audio

3. GMM scoring does not support indexing

GMM Based Algorithm - Analysis

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 6

OutlineIntra-Class Variability Modeling for Speech Processing

1 Introduction to GMM based classification

2 Mapping speech segments into segment space

3 Intra-class variability modeling

4 Speaker diarization

5 Summary

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 7

Mapping Speech Segments into Segment SpaceGMM scoring approximation 1/4

Definitions

X: training session for target speaker

Y: test session

Q: GMM trained for X

P: GMM trained for Y

Goal

Compute Pr(Y |Q) using GMMs P and Q only

Motivation

1. Efficient speaker recognition and indexing

2. More accurate modeling

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 8

QPHdxQxPxQyQYx

T

T

ttTT

,PrlogPrPrlogPrlog1

11

)1(

Negative cross entropy

Mapping Speech Segments into Segment SpaceGMM scoring approximation 2/4

Approximating the cross entropy between two GMMs

1. Matching based lower bound [Aronowitz 2004]

2. Unscented-transform based approximation [Goldberger & Aronowitz 2005]

3. Others options in [Hershey 2007]

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 9

CwwQPH

D

d

D

d

Qdj

D

d

Qj

j

G

g

Pg Q

dj

Pdg

Qdj

Qdj

Pdg

1

2

21

1,

1 21 ,

,

2

,

2

,, loglogmax,

(2)

Matching based approximation

Mapping Speech Segments into Segment SpaceGMM scoring approximation 3/4

Assuming weights and covariance matrices are speaker independent (+ some approximations):

CwQPH

G

g

D

di

dg

Qdg

Pdg

1 1 22

,

2

,,,

(3)

Mapping T is induced:

dg

GMMdg

gdDg

GD

wGMMT

RGMMT

,

,*ˆ;ˆ

:

(4)

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 10

Results

Mapping Speech Segments into Segment SpaceGMM scoring approximation 4/4

Figure and Table taken from:H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 11

1. Anchor modeling projection [Sturim 2001]

• efficient but inaccurate

2. MLLR transofrms [Stolcke 2005]

• accurate but inefficient

3. Kernel-PCA-based mapping [Aronowitz 2007c]

Given - a set of objects

- a kernel function

(a dot product between each pair of objects)

Finds a mapping of the objects into Rn which preserves the

kernel function.• accurate & efficient

Other Mapping Techniques

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 13

Introduction Mapping Modeling Speaker Diarization Summary

OutlineIntra-Class Variability Modeling for Speech Processing

1 Introduction to GMM based classification

2 Mapping speech segments into segment space

3 Intra-class variability modeling

4 Speaker diarization

5 Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 14

Introduction Mapping Modeling Speaker Diarization Summary

The classic GMM algorithm does not explicitly model intra-speaker inter-session variability:• channel, noise• language• stress, emotion, aging

The frame independence assumption does not hold in these cases!

T

tT SySyy1t

1 Pr,...,Pr)1(

dffSySfdfSfyySyyT

tTT

1t

11 ,PrPr,,...,Pr,...,Pr)3(

Instead, we can use a more relaxed assumption:

Intra-Class Variability Modeling [Aronowitz 2005b] Introduction

T

tT fSyfSyy1t

1 ,Pr,,...,Pr)2(

which leads to:

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 15

Introduction Mapping Modeling Speaker Diarization Summary

Speaker

FrameFrame

sequencesequencegenerated independently

a GMM

Old vs. New Generative Models

Session GMM

FrameFrame

sequencesequence

Speaker a PDF over GMM space

a GMM

generated independently

Old Model New Model

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 16

Introduction Mapping Modeling Speaker Diarization Summary

speaker #1 speaker #2

speaker #3

Session-GMM Space

Session-GMM space

GMM for session A of speaker #1

GMM for session B of speaker #1

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 17

GDs~

,~|ˆPr NS

Modeling in Session-GMM space 1/2

Recall mapping T induced by the GMM approximation analysis:

• is called a supervector• A speaker is modeled by a multivariate normal distribution in supervector space:

)3(

• A typical dimension of is 50,000*50,000• is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal

GDΣ~

GDΣ~

dg

GMMdg

gdDg

GD

wGMMT

RGMMT

,

,*ˆ;ˆ

:

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 18

Introduction Mapping Modeling Speaker Diarization Summary

Supervector space

GDΣ~

1

2

1

2

1

2

1

2

1

2

1

2speaker #1 speaker #2

speaker #3 Delta supervector space

sΣ2~

Modeling in Session-GMM Space 2/2Estimating covariance matrix

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 19

• is estimated from the NIST-2006-SRE corpus• Evaluation is done on the NIST-2004-SRE corpus

• ETSI MFCC (13-cep + 13-delta-cep)• Energy based voice activity detector• Feature warping• 2048 Gaussians• Target models are adapted from GI-UBM• ZT-norm score normalization

GDΣ~

Experimental Setup

Datasets

System description

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 20

Results

38% reduction in EER

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 21

• NAP+SVMs [Campbell 2006]

• Factor Analysis [Kenny 2005]

• Kernel-PCA [Aronowitz 2007c]

• Model each supervector as

s S : Common speaker subspace

u U : Speaker unique subspace

• S is spanned by a set of development supervectors (700 speakers) • U is the orthogonal complement of S in supervector space• Intra-speaker variability is modeled separately in S and in U• U was found to be more discriminative than S• EER was reduced by 44% compared to baseline GMM

Other Modeling Techniques

Kernel-PCA based algorithm

us

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 22

Session space

Feature space

x

f(x)

Tx

Common speaker subspace (Rn)

y

f(y)

Ty

uy

ux

Speaker unique subspace

K-PCA

Anchor sessions

Kernel-PCA Based Modeling

Kernel induced

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 23

OutlineIntra-Class Variability Modeling for Speech Processing

1 Introduction to GMM based classification

2 Mapping speech segments into segment space

3 Intra-class variability modeling

4 Speaker diarization

5 Summary

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 24

Goals

• Detect speaker changes – “speaker segmentation”

• Cluster speaker segments - “speaker clustering”

Motivation for new method

Current algorithms do not exploit available training data!

(besides tuning thresholds, etc.)

Method

Explicitly model inter-segment intra-speaker variability from labeled

training data, and use for the metric used by change-detection /

clustering algorithms.

Trainable Speaker Diarization [Aronowitz 2007d]

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 25

Dev data

• BNAD05 (5hr) - Arabic, broadcast news

Eval data

• BNAT05 – Arabic, broadcast news,

(207 target models, 6756 test segments)

System EER (%)

Anchor modeling (baseline) 15.1

Anchor modeling - Kernel based scoring 10.8

Kernel-PCA projection (CSS) 8.8

Kernel-PCA projection (CSS) + inter-segment variability modeling

7.4

Speaker recognition on pairs of 3s segments

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 26

Speaker change detection

• 2 adjacent sliding windows (3s each)

• Speaker verification scoring + normalization

Speaker clustering

• Speaker verification scoring + normalization

• Bottom-up clustering

Speaker Error Rate (SER) on BNAT05

• Anchor modeling (baseline): 12.9%

• Kernel-PCA based method: 7.9%

Speaker Diarization System & Experiments

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 27

1 Introduction to GMM based classification

2 Mapping speech segments into segment space

3 Intra-class variability modeling

4 Speaker diarization

5 Summary

OutlineIntra-Class Variability Modeling for Speech Processing

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 28

• A method for mapping speech segments into a GMM

supervector space was described

• Intra-speaker inter-session variability is modeled in

GMM supervector space

Speaker recognition

• EER was reduced by 38% on the NIST-2004 SRE

• A corresponding kernel-PCA based approach reduces

EER by 44%

Speaker diarization

• SER for speaker diarization was reduced by 39%.

Summary 1/2

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 29

• Speaker recognition [Aronowitz 2005b; Aronowitz 2007c]

• Speaker diarization (“who spoke when”) [Aronowitz 2007d]

• VAD (voice activity detection) [Aronowitz 2007a]

• Language identification [Noor & Aronowitz 2006]

• Gender identification [Bocklet 2008]

• Age detection [Bocklet 2008]

• Channel/bandwidth classification [Aronowitz 2007d]

Summary 2/2Algorithms based on the proposed framework

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 30

[1] D. A. Reynolds et al., “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, 91-108.

[2] D.E. Sturim et al., “Speaker indexing in large audio databases using anchor models”, in Proc. ICASSP, 2001.

[3] H. Aronowitz, D. Burshtein, A. Amir, "Speaker indexing in audio archives using test utterance Gaussian mixture modeling", in Proc. ICSLP, 2004.

[4] H. Aronowitz, D. Burshtein, A. Amir, "A session-GMM generative model using test utterance Gaussian mixture modeling for speaker verification", in Proc. ICASSP, 2005.

[5] P. Kenny et al., “Factor Analysis Simplified”, in Proc. ICASSP, 2005.

[6] H. Aronowitz, D. Irony, D. Burshtein, “Modeling Intra-Speaker Variability for Speaker Recognition ”, in Proc. Interspeech, 2005.

[7] J. Goldberger and H. Aronowitz, "A distance measure between GMMs based on the unscented transform and its application to speaker recognition" , in Proc. Interspeech 2005.

[8] H. Aronowitz, D. Burshtein, "Efficient Speaker Identification and Retrieval", in Proc. Interspeech 2005.

Bibliography 1/2

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 31

[9] A. Stolcke et al., “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, 2005.

[10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines,“ in Proc. ISCA Odyssey Workshop, 2006.

[11] W.M. Campbell et al., “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP 2006.

[12] H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, 2007.

[13] J.R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models” ,in Proc. ICASSP 2007.

[14] H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.

[15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, 2007.

[16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, 2007.[17] T. Bocklet et al., “Age and Gender Recognition for Telephone Applications

Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, 2008.

Bibliography 2/2

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 32

Presentation is available online at: http://aronowitzh.googlepages.com/

Thanks!

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 33

Backup slides

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 34

Session spaceDot-product feature space

f(x)

f(y)

x

yKernel trick

Anchor sessions

f()

Goals: - Map sessions into feature space

- Model in feature space

Kernel-PCA Based Mapping 2/5

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 35

Given - kernel K

- n anchor sessions

Find an orthonormal basis for

Method

1) Compute eigenvectors of the centralized kernel-matrix ki,j =

K(Ai,Aj).

2) Normalize eigenvectors by square-roots of corresponding

eigenvalues → {vi}

3) for is the requested basis

},...,{ 1 nAfAfspan

ini vAfAff ,...,1}{ if

nAA ,...,1

Kernel-PCA Based Mapping 3/5

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 36

nn AxK

AxK

v

v

xT

,

...

,

...:11

is a mapping x→Rn with the property:

Given sessions x, y, may be uniquely represented as:

},...,{/

},...,{

1

1

n

n

AfAfspanFU

AfAfspanC

Common speaker subspace -

Speaker unique subspace -

UuuCccucyfucxf yxyxyyxx ,and,withand

()(,) yfxf

22

yx ccyTxT

Kernel-PCA Based Mapping 4/5

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 37

Session space Feature space

x f(x)

Tx

Common speaker subspace (Rn)

y

f(y)

Ty

uy

ux

Speaker unique subspace

K-PCA

Anchor sessions

Kernel-PCA Based Mapping 5/5

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 38

Modeling in Segment-GMM Supervector Space

Segment-GMM supervector spaceSegment-GMM supervector space

FrameFrame

sequence:sequence:

segment #1segment #1

FrameFrame

sequence:sequence:

segment #2segment #2

FrameFrame

sequence:sequence:

segment #nsegment #n

music

speechsilence

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 39

Segmental Modeling for Audio Segmentation

Goal

• Segment audio accurately and robustly into speech / silence / music segments.

Novel idea

• Acoustic modeling is usually done on a frame-basis.

• Segmentation/classification is usually done on a segment-basis (using smoothing).

Why not explicitly model whole segments?

Note: speaker, noise, music-context, channel (etc.) are constant during a segment.

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 40

10-2

10-1

10-2

10-1

speech miss probability

sile

nce

mis

s pr

obab

ility

SPEECH / SILENCE SEGMENTATION

IBM EVAL06IBM EVAL06 no-padGMM baselineSegmental System EER FA @

FR=0.5%

FR @

FA=1%

EVAL06 FA=24.2% @ FR=0.25%

GMM

baseline

2.9% 7.9% 29.6%

Segmental 1.7% 5.1% 2.7%

Error

reduction

41% 35% 91%

Speech / Silence Segmentation – Results 1/2

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 41

10-3

10-2

10-1

10-2

10-1

speech miss probability

mus

ic m

iss

prob

abili

ty

SPEECH / MUSIC SEGMENTATION

IBM EVAL06IBM EVAL06 no-padGMM baselineSegmental

System EER FA @

FR=0.5%

FR @

FA=1%

EVAL06 FA=69% @ FR=0.25%

GMM

baseline

1.43% 3.4% 3.2%

Segmental 1.27% 2.0% 1.9%

Error

reduction

11% 41% 41%

Speech / Silence Segmentation – Results 2/2

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 42

LID in Session Space

English

Arabic

FrenchSession space

Training session Test session

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 43

1. Front end: shifted delta cepstrum (SDC).

2. Represent every train/test session by a GMM super-vector.

3. Train a linear SVM to classify GMM super-vectors.

Results

• EER=4.1% on the NIST-03 Eval (30sec sessions).

LID in Session Space - Algorithm

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 44

Anchor Modeling Projection

• Speaker indexing [Sturim et al., 2001]

• Intersession variability modeling in projected space [Collet et

al., 2005]

• Speaker clustering [Reynolds et al., 2004]

• Speaker segmentation [Collet et al., 2006]

• Language identification [Noor and Aronowitz, 2006]

nXsXsX ˆ,...,ˆ 1

UBM

iFi X

XXs

Pr

Prlogˆ 1

Given: anchor models λ1,…,λn and session X= x1,…,xF

= average normalized log-likelihood

Projection:

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 45

The classic GMM algorithm does not explicitly model intra-speaker inter-session variability:• Noise• Channel• Language• Changing speaker characteristics – stress, emotion, aging

The frame independence assumption does not hold in these cases!

T

tT SySyy1t

1 Pr,...,Pr)1(

dffSySfdfSfyySyyT

tTT

1t

11 ,PrPr,,...,Pr,...,Pr)2(

Instead, we get:

Intra-Class Variability ModelingIntroduction

fSt Gy ,Pr SG fS ,Pr

top related