Top Banner
Context and Learning in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago May 18, 2007
35

Context and Learning in Multilingual Tone and Pitch Accent Recognition

Mar 21, 2016

Download

Documents

alagan

Context and Learning in Multilingual Tone and Pitch Accent Recognition. Gina-Anne Levow University of Chicago May 18, 2007. Roadmap. Challenges for Tone and Pitch Accent Contextual effects Training demands Modeling Context for Tone and Pitch Accent Data collections & processing - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Context and Learning in Multilingual Tone and Pitch Accent

Recognition

Gina-Anne LevowUniversity of Chicago

May 18, 2007

Page 2: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Roadmap• Challenges for Tone and Pitch Accent

– Contextual effects– Training demands

• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition

• Asides: More tones and features• Reducing Training Demands

– Data collections & structure– Semi-supervised learning– Unsupervised clustering

• Conclusion

Page 3: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Challenges: Context • Tone and Pitch Accent Recognition

– Key component of language understanding• Lexical tone carries word meaning• Pitch accent carries semantic, pragmatic, discourse meaning

– Non-canonical form (Shen 90, Shih 00, Xu 01)

• Tonal coarticulation modifies surface realization– In extreme cases, fall becomes rise

– Tone is relative• To speaker range

– High for male may be low for female• To phrase range, other tones

– E.g. downstep

Page 4: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Challenges: Training Demands• Tone and pitch accent recognition

– Exploit data intensive machine learning• SVMs (Thubthong 01,Levow 05, SLX05)• Boosted and Bagged Decision trees (X. Sun, 02)• HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson

et al, 04,…– Can achieve good results with huge sample sets

• SLX05: ~10K lab syllabic samples -> > 90% accuracy– Training data expensive to acquire

• Time – pitch accent 10s of times real-time• Money – requires skilled labelers• Limits investigation across domains, styles, etc

– Human language acquisition doesn’t use labels

Page 5: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Strategy: Overall

• Common model across languages– Common machine learning classifiers

– Acoustic-prosodic model• No word label, POS, lexical stress info• No explicit tone label sequence model

– English, Mandarin Chinese, isiZulu• (also Cantonese)

Page 6: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Strategy: Context

• Exploit contextual information– Features from adjacent syllables

• Height, shape: direct, relative

– Compensate for phrase contour

– Analyze impact of • Context position, context encoding, context type• > 12.5% reduction in error over no context

Page 7: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Data Collections: I

• English: (Ostendorf et al, 95)– Boston University Radio News Corpus, f2b

– Manually ToBI annotated, aligned, syllabified

– Pitch accent aligned to syllables• Unaccented, High, Downstepped High, Low

– (Sun 02, Ross & Ostendorf 95)

Page 8: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Data Collections: II

• Mandarin: – TDT2 Voice of America Mandarin Broadcast News– Automatically force aligned to anchor scripts

• Automatically segmented, pinyin pronunciation lexicon• Manually constructed pinyin-ARPABET mapping• CU Sonic – language porting

– High, Mid-rising, Low, High falling, Neutral

Page 9: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Data Collections: III

• isiZulu: (Govender et al., 2005)– Sentence text collected from Web

• Selected based on grapheme bigram variation– Read by male native speaker– Manually aligned, syllabified– Tone labels assigned by 2nd native speaker

• Based only on utterance text – Tone labels: High, low

Page 10: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Local Feature Extraction• Uniform representation for tone, pitch accent

– Motivated by Pitch Target Approximation Model• Tone/pitch accent target exponentially approached

– Linear target: height, slope (Xu et al, 99)

• Base features: – Pitch, Intensity max, mean, min, range

• (Praat, speaker normalized)– Pitch at 5 points across voiced region– Duration– Initial, final in phrase

• Slope: – Linear fit to last half of pitch contour

Page 11: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Context Features• Local context:

– Extended features• Pitch max, mean, adjacent points of preceding, following

syllables– Difference features

• Difference between – Pitch max, mean, mid, slope– Intensity max, mean

• Of preceding, following and current syllable

• Phrasal context:– Compute collection average phrase slope– Compute scalar pitch values, adjusted for slope

Page 12: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Classification Experiments

• Classifier: Support Vector Machine – Linear kernel– Multiclass formulation

• SVMlight (Joachims), LibSVM (Cheng & Lin 01)

– 4:1 training / test splits• Experiments: Effects of

– Context position: preceding, following, none, both– Context encoding: Extended/Difference– Context type: local, phrasal

Page 13: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Results: Local ContextContext Mandarin Tone English Pitch

AccentisiZulu Tone

Full 74.5% 81.3% 75.9%

Extend PrePost 74% 80.7% 73.8%

Extend Pre 74% 79.9% 73.6%

Extend Post 70.5% 76.7% 72.3%

Diffs PrePost 75.5% 80.7% 75.8%Diffs Pre 76.5% 79.5% 75.5%Diffs Post 69% 77.3% 72.8%

Both Pre 76.5% 79.7% 75.5%

Both Post 71.5% 77.6% 72.5%No context 68.5% 75.9% 72.2%

Page 14: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Results: Local ContextContext Mandarin Tone English Pitch

AccentisiZulu Tone

Full 74.5% 81.3% 75.9%

Extend PrePost 74% 80.7% 73.8%

Extend Pre 74% 79.9% 73.6%

Extend Post 70.5% 76.7% 72.3%

Diffs PrePost 75.5% 80.7% 75.8%Diffs Pre 76.5% 79.5% 75.5%Diffs Post 69% 77.3% 72.8%

Both Pre 76.5% 79.7% 75.5%

Both Post 71.5% 77.6% 72.5%No context 68.5% 75.9% 72.2%

Page 15: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Results: Local ContextContext Mandarin Tone English Pitch

AccentisiZulu Tone

Full 74.5% 81.3% 75.9%

Extend PrePost 74% 80.7% 73.8%

Extend Pre 74% 79.9% 73.6%Extend Post 70.5% 76.7% 72.3%Diffs PrePost 75.5% 80.7% 75.8%Diffs Pre 76.5% 79.5% 75.5%Diffs Post 69% 77.3% 72.8%Both Pre 76.5% 79.7% 75.5%Both Post 71.5% 77.6% 72.5%No context 68.5% 75.9% 72.2%

Page 16: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Discussion: Local Context• Any context information improves over none

– Preceding context information consistently improves over none or following context information

• English/isiZulu: Generally more context features are better• Mandarin: Following context can degrade

– Little difference in encoding (Extend vs Diffs)

• Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory

Page 17: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Results & Discussion: Phrasal Context

Phrase Context Mandarin Tone English Pitch AccentPhrase 75.5% 81.3%No Phrase 72% 79.9%

•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve

Page 18: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Context: Summary

• Employ common acoustic representation– Tone (Mandarin,isiZulu), pitch accent (English)

• SVM classifiers - linear kernel: 76%,76%, 81%• Local context effects:

– Up to > 20% relative reduction in error– Preceding context greatest contribution

• Carryover vs anticipatory

• Phrasal context effects:– Compensation for phrasal contour improves recognition

Page 19: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Aside: More Tones

• Cantonese:– CUSENT corpus of read broadcast news text– Same feature extraction & representation – 6 tones:

– High level, high rise, mid level, low fall, low rise, low level

– SVM classification:• Linear kernel: 64%, Gaussian kernel: 68%

– 3,6: 50% - mutually indistinguishable (50% pairwise)» Human levels: no context: 50%; context: 68%

• Augment with syllable phone sequence– 86% accuracy: 90% of syllable w/tone 3 or 6: one

dominates

Page 20: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Aside: Voice Quality & Energy• w/ Dinoj Surendran

• Assess local voice quality and energy features for tone – Not typically associated with tones: Mandarin/isiZulu

• Considered: – VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt;

Band energy• Useful: Band energy significantly improves

– Mandarin: neutral tone • Supports identification of unstressed syllables

– Spectral balance predicts stress in Dutch– isiZulu: Using band energy outperforms pitch

• In conjunction with pitch -> ~78%

Page 21: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Roadmap• Challenges for Tone and Pitch Accent

– Contextual effects– Training demands

• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition

• Reducing Training Demands– Data collections & structure– Semi-supervised learning– Unsupervised clustering

• Conclusion

Page 22: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Strategy: Training• Challenge:

– Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data?

• Exploit semi-supervised and unsupervised learning– Semi-supervised Laplacian SVM– K-means and asymmetric k-lines clustering– Substantially outperform baselines

• Can approach supervised levels

Page 23: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Data Collections & Processing• English: (as before)

– Boston University Radio News Corpus, f2b• Binary: Unaccented vs accented• 4-way: Unaccented, High, Downstepped High, Low

• Mandarin:– Lab speech data: (Xu, 1999)

• 5 syllable utterances: vary tone, focus position– In-focus, pre-focus, post-focus

– TDT2 Voice of America Mandarin Broadcast News– 4-way: High, Mid-rising, Low, High falling

• isiZulu: (as before)– Read web sentences

• 2-way: High vs low

Page 24: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Semi-supervised Learning

• Approach: – Employ small amount of labeled data– Exploit information from additional – presumably more

available –unlabeled data • Few prior examples: several weakly supervised: (Wong et al, ’05)

• Classifier: – Laplacian SVM (Sindhwani,Belkin&Niyogi ’05)– Semi-supervised variant of SVM

• Exploits unlabeled examples – RBF kernel, typically 6 nearest neighbors, transductive

Page 25: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Experiments• Pitch accent recognition:

– Binary classification: Unaccented/Accented– 1000 instances, proportionally sampled

• Labeled training: 200 unacc, 100 acc– 80% accuracy (cf. 84% w/15x labeled SVM)

• Mandarin tone recognition:– 4-way classification: n(n-1)/2 binary classifiers– 400 instances: balanced; 160 labeled

• Clean lab speech- in-focus-94%– cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples

• Broadcast news: 70% – Cf. < 50% w/SVM 160 training samples

Page 26: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Unsupervised Learning• Question:

– Can we identify the tone structure of a language from the acoustic space without training?

• Analogous to language acquisition

• Significant recent research in unsupervised clustering• Established approaches: k-means• Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004):

asymmetric k-lines – Little research for tone

• Self-organizing maps (Gauthier et al,2005)– Tones identified in lab speech using f0 velocities

• Cluster-based bootstrapping (Narayanan et al, 2006)• Prominence clustering (Tambourini ’05)

Page 27: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Clustering

• Pitch accent clustering:– 4 way distinction: 1000 samples, proportional

• 2-16 clusters constructed– Assign most frequent class label to each cluster

• Classifier: – Asymmetric k-lines:

» context-dependent kernel radii, non-spherical

– > 78% accuracy: • 2 clusters: asymmetric k-lines best

– Context effects:• Vector w/preceding context vs vector with no context

comparable

Page 28: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Contrasting Clustering• Contrasts:

– Clustering: • 3 Spectral approaches:

– Perform spectral decomposition of affinity matrix» Asymmetric k-lines (Fischer & Poland 2004)» Symmetric k-lines (Fischer & Poland 2004)» Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004)» Binary weights, k-lines clustering

• K-means: Standard Euclidean distance– # of clusters: 2-16

• Best results: > 78%– 2 clusters: asymmetric k-lines; > 2 clusters: kmeans

• Larger # clusters: all similar

Page 29: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Contrasting Learners

Page 30: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Tone Clustering: I

• Mandarin four tones:• 400 samples: balanced• 2-phase clustering: 2-5 clusters each• Asymmetric k-lines, k-means clustering

– Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised)

– Broadcast news: 57% (cf. 74% supervised)– K-means requires more clusters to reach k-lines level

Page 31: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Tone Structure

First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height

Page 32: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Tone Clustering: II

• isiZulu High/Low tones• 3225 samples: no labels• Proportional: ~62% low, 38% high• K-means clustering: 2 clusters

– Read speech, web-based sentences• 70% accuracy (vs 76% fully-supervised)

Page 33: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Conclusions

• Common prosodic framework for tone and pitch accent recognition

– Contextual modeling enhances recognition• Local context and broad phrase contour

– Carryover coarticulation has larger effect for Mandarin

– Exploiting unlabeled examples for recognition• Semi- and Un-supervised approaches

– Best cases approach supervised levels with less training– Exploits acoustic structure of tone and accent space

Page 34: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Current and Future Work

• Interactions of tone and intonation– Recognition of topic and turn boundaries– Effects of topic and turn cues on tone real’n

• Child-directed speech & tone learning• Support for Computer-assisted tone learning• Structured sequence models for tone

– Sub-syllable segmentation & modeling• Feature assessment

– Band energy and intensity in tone recognition

Page 35: Context and Learning in  Multilingual Tone and Pitch Accent Recognition

Thanks• Dinoj Surendran, Siwei Wang, Yi Xu

• Natasha Govender and Etienne Barnard

• V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin

• This work supported by NSF Grant #0414919

• http://people.cs.uchicago.edu/~levow/tai