Top Banner
1 Discriminative Feature Optimization for Speech Recognition Bing Zhang College of Computer & Information Science Northeastern University
39

Discriminative Feature Optimization for Speech Recognition

Jan 30, 2016

Download

Documents

RonaT

Discriminative Feature Optimization for Speech Recognition. Bing Zhang College of Computer & Information Science Northeastern University. Outline. Introduction Problem to attack Methodology Region-dependent feature transform Discriminative optimization of the feature transform - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Discriminative Feature Optimization for Speech Recognition

1

Discriminative Feature Optimization for Speech Recognition

Bing Zhang

College of Computer & Information Science Northeastern University

Page 2: Discriminative Feature Optimization for Speech Recognition

2

Outline

Introduction

Problem to attack

Methodology– Region-dependent feature transform– Discriminative optimization of the feature transform

Implementation

System description & results

Conclusions

Page 3: Discriminative Feature Optimization for Speech Recognition

3

Introduction

Speech recognition– Goal: transcribe speech into text– Performance measurement: word error rate (WER)– Typical approach:

• Training: statistically model the acoustic and linguistic knowledge• Recognition: search for the most probable word sequence using the

models

Speech feature extraction– Reason: raw signals cannot be robustly modeled due to high-

dimensionality, therefore compact features have to be extracted– Two stages of feature extraction:

• speech analysis cepstral coefficients• speech feature transformation

– In this thesis: A better feature transformation approach is developed to reduce the WER of the speech recognition system

Page 4: Discriminative Feature Optimization for Speech Recognition

4

Introduction (cont.)

Acoustic Model

Language Model

Search EngineFeature

ExtractionWord Sequence

Speech Signal

Features

A typical speech recognition system

)|Pr(),|Pr(maxarg* WWWW

X

Word Sequence Acoustic Model Language ModelFeatures

Page 5: Discriminative Feature Optimization for Speech Recognition

5

Language Model

N-grams– Models the conditional probability of any word given N-1 words

in history – The product of N-gram probabilities can be used to approximat

e the probability of a sequence of words

• P(w1, w2, …, wk) ≈ P(w1 ) P(w2 | w1) P(w3 | w1, w2) … P(wN | w1, …, wN-1)

… P(wk-1 | wk-N, ..., wk-2) P(wk | wk-(N-1),

..., wk-1)

– Special cases:• Unigram: P(wi)• Bigram: P(wi | wi-1)• Trigram: P(wi | wi-2,wi-1)

Page 6: Discriminative Feature Optimization for Speech Recognition

6

HMM-based Acoustic Model

Repository of unit HMMs (Hidden Markov Model)– Each HMM is a probabilistic finite state machine with outputs at each

hidden state• Transition probabilities• Observation probabilities (modeled by a mixture of Gaussians for each state)

– Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones

HMM state-clusters: specify which HMM states can share which parameters

Pronunciation dictionary: phonetic spelling of the words

Page 7: Discriminative Feature Optimization for Speech Recognition

7

Example of an HMM

o1 o2 o3 o4 o5 o6

1 42Start 3 End

a11

a12 a23 a34

a22 a33 a44

a13 a24

HMM

Observations

Page 8: Discriminative Feature Optimization for Speech Recognition

8

Example of an HMM

1 42Start 3 End

o1 o2 o3 o4 o5 o6

a11

a12 a23 a34

a33

b1(o1) b1(o2) b2(o3) b3(o4) b3(o5) b4(o6)

b1(o1) b2(o2) b2(o3) b2(o4) b4(o5) b4(o6)

1 42Start End

a12

a22 a44

a24

o1 o2 o3 o4 o5 o6

Page 9: Discriminative Feature Optimization for Speech Recognition

9

HMM-based Acoustic Model

Repository of unit HMMs (Hidden Markov Model)– Each HMM is a probabilistic finite state machine with outputs at each

hidden state• Transition probabilities• Observation probabilities (modeled by a mixture of Gaussians for each state)

– Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones

HMM state-clusters: specify which HMM states can share which parameters

Pronunciation dictionary: phonetic spelling of the words

Page 10: Discriminative Feature Optimization for Speech Recognition

10

Acoustic Training

Maximum likelihood (ML) training– Objective: maximize the conditional likelihood of the observed

features given the model– Algorithm: Expectation-maximization (EM)

Discriminative training– Objective: train the model to distinguish the correct word sequence

from other hypotheses– Criterion

• Minimum phoneme error (MPE)

– Representation of hypotheses: lattices– Algorithm: Extended EM

SIL

SILthis

this

isa test

sentence

sentence

senseSIL

SIL

isthe a

quest

guest

the

is

Page 11: Discriminative Feature Optimization for Speech Recognition

11

Feature Extraction

Speech analysis– Deals with the problem of extracting distinguishing

characteristics (e.g., formant locations) of speech from digital signals

– Examples: MFCC (Mel-frequency cepstral coefficients), PLP (perceptual linear prediction)

– Resulting features: cepstral coefficients

Speech feature transformation– Applied on top of the cepstral coefficients– Transform the cepstral features to better fit the model

• help the HMM to model the trajectory of the cepstral features• fit the diagonal covariance assumption of the Gaussian components

Page 12: Discriminative Feature Optimization for Speech Recognition

12

Commonly Used Feature Transforms

LDA (linear discriminant analysis)– Transform the features to maximize the distance between

different classes while keeping each class as compact as possible

– Assumes the all classes have equal covariance

HLDA (heteroscedastic linear discriminant analysis)– Remove the equal covariance assumption of LDA– Find the feature transform that maximizes the likelihood of the

data with respect to the acoustic model in the transformed space

Others – HDA (heteroscedastic discriminant analysis)– MLLT (maximum likelihood linear transform)

Page 13: Discriminative Feature Optimization for Speech Recognition

13

Drawbacks of Traditional Feature Transforms

Inaccurate assumptions about the acoustic model– LDA assumes equal-class covariance– HDA & LDA ignore the diagonal covariance assumption

Linear transform– Linear transform has limited power for feature extraction– Using more powerful transforms can be risky when the criterion

does not correlate with the WER

The criteria do not correlate with the WER– Performance degrades on high-dimensional input features

• Experimental results in the thesis– Performance degrades on highly-correlated input features

• Example on the next slide

Page 14: Discriminative Feature Optimization for Speech Recognition

14

Example

If projected to 1-D– HLDA will map all samples to one single point– LDA will fail to find the answer at all because the covariance matrix of each

class is singular

XY

Z

X

Z

The data has linear dependency between two dimensions such that: Z=2X

Page 15: Discriminative Feature Optimization for Speech Recognition

15

A Better Approach

Region-dependent transform– Nonlinear– Computationally inexpensive to train

Discriminative training of the feature transform– Criterion correlates well with the WER

Detailed acoustic model in feature training

Page 16: Discriminative Feature Optimization for Speech Recognition

16

Region Dependent Transform (RDT)

-5 0 5 10 15 20-6

-4

-2

0

2

4

6

8

f1

f2

fN

r2

r1

rN

RDT:– Divides the acoustic spac

e to multiple regions• e.g., r1, r2, …, rN

– Applies a different transform based on which region the input feature vector belongs to

• e.g., f1, f2, …, fN

To avoid making hard decisions when choosing which transform to apply, the posterior probabilities of the regions are used to interpolate the transformed results:

N

iti

N

itititRDTt rwherefrF

11

1)|Pr(,)()|Pr()( oooox

Page 17: Discriminative Feature Optimization for Speech Recognition

17

More Details of RDT

Input features: long-span features– A long span feature vector is formed by concatenating the

cepstral features from consecutive frames, centered at the current frame

– Advantage: contains information about the acoustic context of the current frame

Division of the regions: global Gaussian mixture model (GMM)– Trained via unsupervised clustering– Each Gaussian component in the GMM corresponds to a region

Region-specific transforms– In general, they can be any projections of long-span feature

vectors– In this thesis, linear projections are studied

Page 18: Discriminative Feature Optimization for Speech Recognition

18

Special Cases of RDT

RDT

RDLT

SPLICEfMPE#MPE-HLDAMean-offset

fMPE#

Linear projection

Only one region Only offset Rotation matrix plus offset

: ,n pif n pR RGeneric projection

( )i t i t if o A o b

[1,1]i ( )i t t if o Po b ( )i t i t if o TPo b

P is not region-dependent

Note (#): fMPE also includes a context-expansion layer, which does not fit this categorization. (see thesis for details)

Page 19: Discriminative Feature Optimization for Speech Recognition

19

Projections vs. Offsets in RDT

( )i t i t if o A o b

Projection Offset

Transform # Uniq. proj. # Uniq. offset WER (%)

LDA+MLLT - - 25.9

RDT 1 0 24.9

RDT 0 1000 24.6

RDT 1 1000 24.0

RDT 1000 0 22.3

RDT 1000 1000 22.3

The projection and the offset in RDT:

Different regions can share the same projections and/or offsets. So the unique number of projections/offsets can be less than the number of regions.

Page 20: Discriminative Feature Optimization for Speech Recognition

20

Optimization Criterion of RDT

Minimum Phoneme Error (MPE) criterion– Gives significant gains

when used to train the HMM

– Correlates well with WER– Can be rewritten as a

function of the feature transform:

R

r

rK

krrRDTrRDTMPE kk

WWH1

)(

1

)α()),(F|Pr()F,,( OO

MPE ScoreW

ER

O, Or: original feature vectors; λ: the HMM; FRDT: the feature transform;

α(Wrk): the accuracy score of hypothesized word sequence Wrk

Page 21: Discriminative Feature Optimization for Speech Recognition

21

HMM Updating Methods

In MPE, the HMM depends on the transformed features, so we should define how it is updated– When we choose the HMM updating methods, the concern is to

make the trained transform be more generic, i.e., reusable for different training setups including:

• both ML and MPE training• different types of HMMs

– If we can make the feature transform focus on separating the data, this goal can be achieved

– To ensure that, the HMM should better describe the data rather than anything else

Page 22: Discriminative Feature Optimization for Speech Recognition

22

HMM Updating Methods (cont.)

If the HMM is updated discriminatively, e.g., under MPE– Some Gaussians in the HMM will model decision boundaries, be

ing away from the mass of the data– The feature transform will be misled from separating the real da

ta– The resulting transform is less generic– This method is OK if there is only one HMM to train

If the HMM is updated under ML– The Gaussians will stay on the data– The feature transform will also focus on the data– The resulting transform is more generic– This method is preferred if there are different HMMs to train

We assume ML updating of the HMM in this thesis

Page 23: Discriminative Feature Optimization for Speech Recognition

23

Example

Discriminative Model ML Model

Before transform

After transform

Since the model is already discriminative, nothing needs to be done here.

Page 24: Discriminative Feature Optimization for Speech Recognition

24

Training the Feature Transform

The transform is trained using a numerical optimization algorithm

Derivative of MPE with respect to the transform– Two terms in the derivative

• MPE depends on the transformed features directly direct derivative• MPE depends on the transform through the HMM, which in turn depends

on transformed features indirect derivative– Two passes of data processing

• The first pass computes the direct derivative using lattices• The second pass computes the indirect derivative using reference

transcripts

Page 25: Discriminative Feature Optimization for Speech Recognition

25

Training Procedure

Iterative update of RDT using numerical optimization

RDT

Train/Update HMM

Compute MPEDerivative

Update RDT

Original

features

Apply Transform

Projected features

HMM

Derivative

Reference transcripts

Lattices

Page 26: Discriminative Feature Optimization for Speech Recognition

26

Implementation

Feature transform network– A directed acyclic network of primitive

components– Design goals:

• reuse primitive components (e.g., linear projection, frame-concatenation)

• reuse the algorithm that applies the transform or computes the derivative

• easy to extend to other transforms• efficient usage of CPU time & memory

– Impact:• enables numerical optimization of any

differentiable components including but not limited RDT

• simplifies the BBN system by providing a unified representation of various transforms

• added flexibility to the front-end processing in the BBN system

Concatenation

Projection

Gauss. Mixture

RDT

Cepstra

Page 27: Discriminative Feature Optimization for Speech Recognition

27

RDT and the State-of-the-art System

The state-of-the-art system at BBN– Two sub-systems

• Speaker-independent (SI) system

• Speaker-adaptive (SA) system

– Two phases of training• ML (initialize MPE training)• MPE

– Three pass decoding• Three tied-mixture acoustic

models

How RDT interacts with the system– Trained once, used in

three types of acoustic models

– Integrated with speaker adaptation

Page 28: Discriminative Feature Optimization for Speech Recognition

28

RDT in Speaker-independent (SI) Training

LDA+MLLT

ML Training

Lattice Generation

MPE Training

MPE-SI HMM

ML-SI HMM

Lattices

Initial Transform

Bootstrapping

RDT Training

RDT & HMM

SI training baselineSI training with RDT

Page 29: Discriminative Feature Optimization for Speech Recognition

29

Experimental Setup

Data– Training: English Conversational Telephone Speech (CTS),

2300 hours SWB+Fisher– Testing: Eval03+Dev04, 3 hours SWB-II, 6 hours Fisher

Analysis– 14 Perceptual Linear Prediction (PLP) cepstral coefficients a

nd normalized energy– Vocal Tract Length Normalization (VTLN)

RDT– 15-frame long-span features projected to 60 dimensions– initialized from LDA+MLLT– 1000 regions, one linear projection per region– crossword state-cluster tied model (SCTM), 7K clusters.– number of Gaussians per state-cluster in the HMM varies in

different experiments

Page 30: Discriminative Feature Optimization for Speech Recognition

30

SI Results (ML)

TransformML Model WER (%)

12-GPS 44-GPS 120-GPS

LDA+MLLT 25.9 23.7 22.5

12-GPS RDT 22.3 22.1 21.9

44-GPS RDT - 21.6 20.8#

Description– Two RDTs were trained using the HMMs with 12 Gaussians per stat

e-cluster (GPS) and 44 GPS, respectively– For decoding, several ML crossword SCTM models with different s

izes were trained using either LDA+MLLT or RDT– Only the lattice-rescoring pass was run in decoding for simplicity– (#): After other two models (STM, SCTM-NX) were retrained, the W

ER was further reduced to 20.4%, i.e., 9.3% relatively better than the LDA+MLLT result

Page 31: Discriminative Feature Optimization for Speech Recognition

31

SI Results (MPE)

TransformMPE Model WER (%)

12-GPS 44-GPS 120-GPS

LDA+MLLT 22.1 21.1 20.4

12-GPS RDT 21.2 20.8 20.4

44-GPS RDT - 20.3 19.6#

Description– Same as the ML experiments, except that the final models were

trained under MPE– (#): After other two models (STM, SCTM-NX) were trained, the

WER was further reduced to 19.2%, i.e., 5.8% relatively better than the LDA+MLLT result

Page 32: Discriminative Feature Optimization for Speech Recognition

32

Speaker Adaptation

Speaker adaptation (figure)– Assumption: the speaker-dependent

models are linearly transformed from an SI model

– Variations• MLLR: assume that only Gaussian

means are transformed• CMLLR: both means & covariances

are transformed equivalent to applying the inverse transform to features while keeping model fixed

Speaker-Adaptive Training (SAT)– The SI model is not optimal for

adaptation– SAT tries to estimate a better

model that when transformed gives the best likelihood of the data

SI Model

A(2)

S(2) Model

A(1)

S(1) ModelS(3) Model

S(N) Model

A(3)

A(N)

Page 33: Discriminative Feature Optimization for Speech Recognition

33

RDT in Speaker-adaptive Training (SAT)

MPE Training

MPE-SAT HMM

SI RDT & HMM

CMLLR Estimation

Train SI RDT

SD Transforms

ML SAT

ML-SAT HMM

Straightforward approach

Use SI-RDT transparently– Simple– But RDT is not optimized for SAT

Page 34: Discriminative Feature Optimization for Speech Recognition

34

RDT in Speaker-adaptive Training (SAT)

MPE Training

MPE-SAT HMM

SI RDT & HMM

CMLLR Estimation

Train SI RDT

SD Transforms

ML SAT

ML-SAT HMM

Update RDT

SA RDT & HMM

Iterative approach (SA-RDT)

Alternately update RDT and the speaker- dependent (SD) transforms– Back-propagation is used to

compute the derivative, since SD transforms are applied on top of RDT

– RDT is optimized for SAT

Page 35: Discriminative Feature Optimization for Speech Recognition

35

Adapted Results

Transform SAT-ML WER (%) SAT-MPE WER (%)

LDA+MLLT 20.2 18.5

SI-RDT 18.8 17.6

SA-RDT 18.0 17.2

Description– Same training & testing data, state-cluster and LM as the unadapt

ed experiments– 10.9% relative WER reduction for the ML system– 7.0% relative WER reduction for the MPE system

Page 36: Discriminative Feature Optimization for Speech Recognition

36

Alternative Procedure for SA-RDT

MPE Training

MPE-SAT HMM

SI LDA+MLLT & HMM

CMLLR Estimation

SD Transforms

ML SAT

ML-SAT HMM

Update RDT

SA RDT & HMM

Simplified SA-RDT

Similar to the original SA-RDT

But the speaker-dependent transforms are estimated using the baseline model & features

Page 37: Discriminative Feature Optimization for Speech Recognition

37

Adapted Results

Transform SAT-ML WER (%) SAT-MPE WER (%)

LDA+MLLT 21.5 20.6

SA-RDT1 20.8 19.7

SA-RDT2 20.5 19.2

Description– 500 hours of training data– Another set of SD transforms were used before LDA/RDT– SA-RDT1 was using the simplified procedure– SA-RDT2 was using the original procedure– The simplified procedure gave 2/3 of the gain by training the RDT

only once

Page 38: Discriminative Feature Optimization for Speech Recognition

38

Conclusions

Original work– Region-dependent transform– Improved discriminative feature training that leads to more

generic feature transform– Improved SAT procedure using RDT

Impact– RDT encompasses several other feature transforms, including

MPE-HLDA, SPLICE and the core of fMPE and mean-offset fMPE– The method gives significant WER reduction: 7% relative

reduction to the SAT-MPE English CTS system– The method is potentially helpful for exploring novel acoustic

features• We do not have to worry about the negative effect when we add new

features to the input of the feature transform, because the training will decide whether to use the new features and how to use them based on a criterion that is correlated to WER

Page 39: Discriminative Feature Optimization for Speech Recognition

39

Publications

B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz. Long span features and minimum phoneme heteroscedastic linear discriminant analysis. In Proceedings of EARS RT-04 Workshop, 2004.

B. Zhang and S. Matsoukas. Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition, In Proceedings of ICASSP, 2005.

B. Zhang, S. Matsoukas and R. Schwartz. Discriminatively trained region-dependent transform for speech recognition. In Proceedings of ICASSP, 2006.– Nominated for the Student Paper Award– Awarded the Spoken Language Processing Grant by the IEEE Signal

Processing Society

B. Zhang, S. Matsoukas and R. Schwartz. Recent progress on the discriminative region-dependent transform for speech feature extraction. In Proceedings of ICSLP, 2006.