Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013

Prosody Prosody Pitch, Intensity, Rhythm, Silence Prosody carries information about a speakers intent and identity. Here: prosodic recognition of Speaking Style Nativeness Speaker 8/26/13 1

Approach Unsupervised clustering of acoustic/prosodic features. Sequence modeling of cluster identities 8/26/13 2

K-Means K-means is a simple distance based clustering algorithm. Iterative, non-deterministic (sensitive to initialization) Must specify K. We evaluate K between 2 and 100. Optimal value from cross-validation for each task 8/26/13 3

Dirichlet Process GMMs Non-parametric infinite mixture model need a prior of the dirichlet process and a prior over N a zero mean gaussian still need to set hyper parameters and G 0 Stick-breaking & Chinese Restaurant metaphors Blei and Jordan 2005 Variational Inference Rich get Richer 8/26/13 4 Plate notation from M. Jordan 2005 NIPS tutorial

DPGMM Rich get Richer 8/26/13 5 Artificially omit the largest cluster = 0. 25

Prosodic Event Distribution ToBI Prosodic Labels Pitch Accents, Phrase Accent/Boundary Tones 8/26/13 6 Accent Type Distribution Phrase Ending Distribution

Sequence Modeling SRILM 3-gram model Backoff & GT smoothing Clusters learned over all material Sequence models trained over train sets 8/26/13 7

Experiments Speaking Style, Nativeness, Speaker Recognition Evaluation 500 samples between 10-100 syllables (~2-20 seconds) ToBI, K-Means, DPGMM, DPGMM (removing the largest cluster) 5 fold Cross-validation to learn hyperparameters Classification Train one SRILM model per class. Classify by lowest perplexity Outlier Detection Train a single model. Classifier learns a perplexity threshold 8/26/13 8

Data Boston Directions Corpus READ, SPONTANEOUS 4 speakers (used for Speaker Classification) Boston University Radio News Corpus BROADCAST NEWS 6 speakers Columbia Games Corpus SPONTANEOUS DIALOG 13 speakers Native Mandarin Chinese Speakers reading BURNC stories. 4 speakers All ToBI Labeled 8/26/13 9

Features Villing (2004) pseudosyllabification Syllables with mean intensity below 10dB are considered silent 7 Features Mean range normalized intensity Mean range normalized delta intensity Mean z-score normalized log f0 Mean z-score normalized delta log f0 Syllable duration Duration of previous silence (if any) Duration of following silence (if any) 8/26/13 10

Consistency with ToBI labels V-Measure between ToBI Accent Types and clusters ToBI Intonational Phrase-ending Tones and clusters K-means, solid line DPGMM, gray line for reference (doesnt vary by more than 0.001) 8/26/13 11 AccentingPhrasing

Speaking Style Recognition 4 styles: READ, SPON, BN, DIALOG Single speaker for evaluation. 8/26/13 12 Classification Outlier Detection - Dialog

Nativeness Recognition Native (BURNC) vs. Non-Native Single speaker for evaluation. 8/26/13 13 Classification Outlier Detection - Native

Speaker Recognition 4 BDC Speakers 6 tasks for training, 3 for testing 8/26/13 14 Classification Outlier Detection 6 BURNC Speakers Detect f2b vs. others

Conclusions K-means works well to represent prosodic information DPGMM does not work so well out-of-the-box. Despite being non-parametric, hyperparameter setting is still critically important Future Work Larger acoustic/prosodic feature set. requires pre-processing Evaluating the universality of prosodic representations Integration of K-means and DPGMM. Use one to seed the other. 8/26/13 15

Thank you [email protected] http://speech.cs.qc.cuny.edu 8/26/13 16

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.

Documents