Top Banner
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International
17

Voicing Features

Jan 13, 2016

Download

Documents

loyal

Voicing Features. Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International. Phonetically Motivated Features. Problem: Cepstral coefficients fail to capture many discriminative cues. Front-end optimized for traditional Mel cepstral features. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Voicing Features

1

Voicing Features

Horacio Franco, Martin GraciarenaAndreas Stolcke, Dimitra Vergyri, Jing Zheng

STAR Lab. SRI International

Page 2: Voicing Features

2

Phonetically Motivated Features

• Problem:

– Cepstral coefficients fail to capture many discriminative cues.

– Front-end optimized for traditional Mel cepstral features.

– Front-end parameters are a compromise solution for all phones.

Page 3: Voicing Features

3

Phonetically Motivated Features

• Proposal:

– Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends.

– Optimize each specific front-end to improve discrimination.

– Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding.

– General framework for multiple phonetic features. First approach: voicing features.

Page 4: Voicing Features

4

• Voicing features algorithms:1. Normalized peak autocorrelation (PA) . For time frame X

max computed in pitch region 80Hz to 450Hz2. Entropy of high order cepstrum (EC) and linear spectra (ES).

If And H is the entropy of Y,

thenEntropy computed in pitch region 80Hz to 450Hz

Voicing Features

)0(/)}({max RxxiRxxPA i)];()([)( itXtXEiRxx

);( );( SPECHESCEPSHEC

f

f

fY

fYfYP

fYPfYPYH

2

22

22

)(

)())((

;)))((log())(()(

));((;)(2

LSPECLogIDFTCEPSXDFTLSPEC

Page 5: Voicing Features

5

Voicing Features3. Correlation with template and DP alignment

[Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform for the frequency band for speech signal

If IT is an impulse train, the template is and the signal DLFTthe correlation for frame j with the template is

the DP optimal correlation is max computed in pitch region 80Hz to 450Hz

)},0(/),({max jRytjiRytCT DPi

)];(),([),( ifTjfYEjiRyt

;)(2

XDLFTY

],[ 21 ff

1

)ln()ln(dlnf

; 2 ;)(1

12

dlnf)ln( 1

N

ff

eTwenxN

Y ifsin

njwi

i

)(nxDLFT

2)(ITDLFTT

Page 6: Voicing Features

6

Voicing Features• Preliminary exploration of voicing features:

- Best feature combination: Peak Autocorrelation + Entropy Cepstrum

- Complementary behavior of autocorrelation and entropy features for high and low pitch.

Low pitch: time periods are well separated therefore correlation is well defined.

High pitch: harmonics are well separated and cepstrum is well defined.

Page 7: Voicing Features

7

Voicing Features• Graph of voicing features:

w er k ay n d ax f s: aw th ax v dh ey ax r

Page 8: Voicing Features

8

Voicing Features• Integration of Voicing Features:

1 - Juxtaposing Voicing Features:

• Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD)

• Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.

Page 9: Voicing Features

9

Voicing Features• Train small switchboard database (64 hours). Test on

dev 2001. WER for both sexes. • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. • VTL and speaker mean and var. norm. Genone acoustic

model. Non-X-word, MLE trained, Gender Dep. Bigram LM.

Window Length Optimization WER

Baseline 41.4%

Baseline + 2 voicing (25.6 msec) 41.2 %

Baseline + 2 voicing (75 msec) 40.7 %

Baseline + 2 voicing (87.5 msec) 40.5 %

Baseline + 2 voicing (100 msec) 40.4 %

Baseline + 2 voicing (112.5 msec) 41.2 %

Page 10: Voicing Features

10

Voicing Features2 – Voiced/Unvoiced Posterior Features:

• Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40.

• Similar setup as before. Males only results.

• Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.

Recognition Systems WER

Baseline 39.2 %

Baseline + voicing posterior 39.7 %

Page 11: Voicing Features

11

Voicing Features3 – Window of Voicing Features + HLDA: • Juxtapose MFCC features and window of voicing

features around current frame. • Apply dimensionality reduction with HLDA. Final feature

had 39 dimensions. • Same setup as before, MFCC+D+DD+3rd diffs. Both

sexes.• Baseline 1.5% abs. better, Voicing improves 1% more.

Recognition Systems WER %

Baseline + HLDA 39.9

Baseline + 1 frame, 2 voicing + HLDA

Baseline + 5 frames, 2 voicing + HLDA

38.9

Baseline + 9 frames, 2 voicing + HLDA

39.5

39.5

Page 12: Voicing Features

12

Voicing Features4 – Delta of Voicing Features + HLDA: • Use delta and delta-delta features instead of window of

voicing features. Apply HLDA to juxtaposed feature.• Same setup as before, MFCC+D+DD+3rd diffs. Males

only.

• Reason may be variability in voicing features produce noisy deltas. • HLDA weighting of “window of voicing features” is

similar to average. ---------------------------------------------------------------------------------- The best overall configuration was MFCC+D+DD+3rd

diffs. and 10 voicing features + HLDA.

Recognition Systems WER

Baseline + HLDA 37.5 %

Baseline + voicing + delta voicing + HLDA

37.6 %

Page 13: Voicing Features

13

Voicing Features• Voicing Features in SRI CTS Eval. Sept 03

System:• Adaptation of MMIE cross-word models w/wo voicing

features. • Used best configuration of voicing features.• Train on Full SWBD+CTRANS data. Test on EVAL’02.• Feature: MFCC+D+DD+3rd diffs.+HLDA• Adaptation: 9 transforms full matrix MLLR.• Adaptation hypothesis from: MLE non cross-word

model, PLP front end with voicing features.Recognition Systems WER

Baseline EVAL 25.6 %

Baseline EVAL + voicing 25.1 %

Page 14: Voicing Features

14

Voicing Features• Hypothesis Examples:

REF: OH REALLY WHAT WHAT KIND OF PAPER HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPERHYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER

REF: YOU KNOW HE S JUST SO UNHAPPYHYP BASELINE: YOU KNOW YOU JUST I WANT HAPPYHYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY

Page 15: Voicing Features

15

Voicing Features• Error analysis:

– In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase.

– Still need a more detailed study of speaker dependent performance.

• Implementation:– Implemented a voicing feature engine in DECIPHER

system.– Fast computation, using one FFT and two IFFTs per

frame for both voicing features.

Page 16: Voicing Features

16

Voicing Features• Conclusions:

– Explored how to represent/integrate the voicing features for best performance.

– Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system.

• Future work:

– Still need to further explore feature combination/selection

– Develop more reliable voicing features, features not always reflect actual voicing activity

– Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).

Page 17: Voicing Features

17