Top Banner
Learning Long-Term Temporal Feature in LVCSR Using Neur al Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI), Berkeley, CA, USA, 200 4MLMI 報報 : 報報報 報報 : 2005/02/24
24

報告 : 張志豪 日期 : 2005/02/24

Jan 12, 2016

Download

Documents

tomai

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI), Berkeley, CA, USA, 2004MLMI. 報告 : 張志豪 日期 : 2005/02/24. Reference. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 報告  :  張志豪 日期  : 2005/02/24

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks

Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI), Berkeley, CA, USA, 2004MLMI

報告 : 張志豪日期 : 2005/02/24

Page 2: 報告  :  張志豪 日期  : 2005/02/24

2

Reference

• 1998 Hynek Hermansky, “TRAPS - Classifiers of Temporal Patterns”, ICSLP

• 1999 Hynek Hermansky, “Data-Derived Nonlinear Mapping For Feature Extraction in HMM”, ICASSP

• 2003 Barry Chen, “Learning Discriminative Temporal Patterns in Speech : Development of Novel Traps-Link Classifiers”, Eurospeech

• 2003 Hemant Misra, “New Entropy Based Combination Rules in HMM/ANN Multi-Stream ASR”, ICASSP

Page 3: 報告  :  張志豪 日期  : 2005/02/24

3

Outline

• Introduction– TRAPS

– HATS

• MLP Architectures– One Stage Architectures

– Two Stage Architectures

• Experiment– Frame accuracy

– Word accuracy

– Combine long-term and short-term

Page 4: 報告  :  張志豪 日期  : 2005/02/24

4

Introduction

• Hynek Hermansky’s group pioneered a method to capture long-term(500-1000ms) information for phonetic classification using multi-layered perceptrons(MLP). They developed an MLP architecture called TRAPS, which stands for “TempoRAl PatternS”

• TRAPS perform about as well as more conventional ASR systems using short-term features, and improve word error rates when used in combination with these short-term features.

• This paper worked on improving the TRAPS architecture in the context of TIMIT phoneme recognition. Hidden Activation TRAPS (HATS), which differ from TRAPS in that HATS use the hidden activations of the critical band MLPs instead of their outputs as inputs to the “merger” MLP.

Page 5: 報告  :  張志豪 日期  : 2005/02/24

5

Log-Critical Band Energies (LCBE) (1/2)

ConventionalFeature Extraction

Page 6: 報告  :  張志豪 日期  : 2005/02/24

6

Log-Critical Band Energies (LCBE) (2/2)

TRAPS/HATSFeature Extraction

Page 7: 報告  :  張志豪 日期  : 2005/02/24

7

MLP Architectures

• One Stage Approach (unconstrained)– 15 Bands x 51 Frames

• Two Stage Approach (constrained)– Linear Approach

• PCA40

• LDA40

– Non-Linear Approach• TRAPS

• HATS

Page 8: 報告  :  張志豪 日期  : 2005/02/24

8

One Stage Approach (1/3)

• The paper use LCBEs calculated every 10 ms on 8 kHz sampled speech which gives a total of 15 bark scale spaced LCBEs. There are mean and variance normalized per utterance.

• Use 51 frames of all 15 bands of LCBEs as inputs to an MLP. These inputs are built by stacking 25 frames before and after the current frame to the current frame, and the target phoneme comes from the current frame.

• The network is trained with output targets that are “1.0” for the class associated with the current frame, and “0” for all others. The MLPs are trained on 46 phoneme targets, and consist of a single hidden layer with sigmoidal nonlinearity and an output layer with softmax nonlinearity.

• Baseline system : “15 Bands x 51 Frames”

unconstraint

Page 9: 報告  :  張志豪 日期  : 2005/02/24

9

One Stage Approach (2/3)

• Softmax nonlinearity– If you want the outputs of a network to be interpretable as posterior probabilities for a cat

egorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one. The purpose of the softmax activation function is to enforce these constraints on the outputs. Let the net input to each output unit be q_i, i=1,...,c, where c is the number of categories. Then the softmax output p_i is:

• Sigmoidal nonlinearity

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

sigmf, P=[2 4]

1

exp __

exp _c

j

q ip i

q j

( )

1

1 a x cf

e

Page 10: 報告  :  張志豪 日期  : 2005/02/24

10

One Stage Approach (3/3)

Page 11: 報告  :  張志豪 日期  : 2005/02/24

11

Two Stage Approach

• They developed an MLP architecture called TRAPS, which stands for “TempoRAl PatternS”. The TRAPS system consists of two stages fo MLPs.1. Critical band MLPs learn phone probabilities posterior on the input, which is a s

et of consecutive frames of LCBEs, or LCBE trajectory.

2. A “merger” MLP merges the output of each of these individual critical band MLPs resulting in overall phone posteriors probabilites.

• Correlations among individual frames of LCBEs from difference frequency bands are not directly modeled; instead, correlation among long-term LCBE trajectories from different frequency bands are modeled.

Page 12: 報告  :  張志豪 日期  : 2005/02/24

12

Linear Approaches (1/2)

• Feature– The paper calculate PCA transforms for successive 51 frames of each of the 15

individual 51 frames of LCBE resulting in a 51 x 51 transform matrix for each of the 15 bands.

– Then use this transform to orthogonalize the temporal trajectory in each band, retaining only the top 40 features per band.

– Final use these transformed features as input to an MLP.

Page 13: 報告  :  張志豪 日期  : 2005/02/24

13

Linear Approaches (2/2)

Page 14: 報告  :  張志豪 日期  : 2005/02/24

14

Non-Linear Approaches (1/2)

Page 15: 報告  :  張志豪 日期  : 2005/02/24

15

Non-Linear Approaches (2/2)

Page 16: 報告  :  張志豪 日期  : 2005/02/24

16

Experimental Setup

• Training : ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular.

• Testing : 2001 Hub-5 Evaluation Set (Eval2001) – A large vocabulary conversational telephone speech test set

– 2,255,609 frames and 62,890 words

• Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models.

Page 17: 報告  :  張志豪 日期  : 2005/02/24

17

Frame Accuracy (1/2)

• Classification is deemed correct when the highest output of the MLP corresponds to the correct phoneme label.

• A conventional intermediate temporal context MLP that uses 9 frames of per-side normalized (mean, variance, vocal tract length) PLP plus deltas and double deltas as inputs (PLP 9 Frames).

• Result– With the exception of the TRAPS system, all of the two-stage systems do bette

r than this.

– HATS Before Sigmoid and TRAPS Before Softmax perform comparably at 65.80% and 65.85% respectively, while PCA and LDA approaches perform similarly at 65.50% and 65.52% respectively.

Page 18: 報告  :  張志豪 日期  : 2005/02/24

18

Frame Accuracy (2/2)

Page 19: 報告  :  張志豪 日期  : 2005/02/24

19

Word Error Rates (1/2)

• System (46D)– The experiment take the log of the outputs from the MLPs and then decorrelate

the features via PCA.

– Apply mean and variance normalization in these transformed outputs.

• Result– The HATS always ranks 1 when compared to all other long temporal systems,

achieving 7.29% relative improvement over the baseline.

– The TRAPS doesn’t provide an improvement over the baseline, but all of the other approaches do. The final softmax nonlinearity in the critical band MLPs in TRAPS is the only difference between it and TRAPS Before Softmax. So including this nonlinearity during recognition, causes performance degradation. It is likely that the softmax’s output normalization is obscuring useful information that the second stage MLP needs.

Page 20: 報告  :  張志豪 日期  : 2005/02/24

20

Word Error Rates (2/2)

Page 21: 報告  :  張志豪 日期  : 2005/02/24

21

Combine Long-Term with Short-Term (1/3)

• SRI’s EARS Rich Transcription 2003 front-end features (short-term) (39 D)– Baseline HLDA(PLP+3d) Feature

1. 12th order PLP plus first three ordered deltas,

2. mean, variance, and vocal tract length normalized

3. transformed by heteroskedastic linear discriminant analysis (HLDA),

keeping the top 39 features.

Page 22: 報告  :  張志豪 日期  : 2005/02/24

22

Combine Long-Term with Short-Term (2/3)

• Methods (64 D)– Appended the top 25 dimensions after PCA on each of the temporal features to

the baseline HLDA(PLP+3d) features.

– PLP 9 Frames

– Combine HATS and PLP 9 Frames systems using an inverse entropy weighting method, take the log followed by PCA to 25 dimension and append to HLDA(PLP+3d) feature, can get the “Inv Entropy Combo HATS+PLP 9 Frames” Frames.

• Result– HATS improves 3.23% WER.

– PLP 9 Frames is the same as HATS.

– Combine PLP 9 Frames with HATS improves 8.60% WER.

Page 23: 報告  :  張志豪 日期  : 2005/02/24

23

Combine Long-Term with Short-Term (3/3)

Page 24: 報告  :  張志豪 日期  : 2005/02/24

24

Conclusions

• So TRAPS including this softmax during recognition, causes performance degradation.

• Inverse entropy weighting is good research direction.

• Combine long-term with short-term information has improvement.