This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Slide 2
Modeling Prosodic Sequences with K-Means and Dirichlet Process
GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August
26, 2013
Slide 3
Prosody Prosody Pitch, Intensity, Rhythm, Silence Prosody
carries information about a speakers intent and identity. Here:
prosodic recognition of Speaking Style Nativeness Speaker 8/26/13
1
Slide 4
Approach Unsupervised clustering of acoustic/prosodic features.
Sequence modeling of cluster identities 8/26/13 2
Slide 5
K-Means K-means is a simple distance based clustering
algorithm. Iterative, non-deterministic (sensitive to
initialization) Must specify K. We evaluate K between 2 and 100.
Optimal value from cross-validation for each task 8/26/13 3
Slide 6
Dirichlet Process GMMs Non-parametric infinite mixture model
need a prior of the dirichlet process and a prior over N a zero
mean gaussian still need to set hyper parameters and G 0
Stick-breaking & Chinese Restaurant metaphors Blei and Jordan
2005 Variational Inference Rich get Richer 8/26/13 4 Plate notation
from M. Jordan 2005 NIPS tutorial
Slide 7
DPGMM Rich get Richer 8/26/13 5 Artificially omit the largest
cluster = 0. 25
Slide 8
Prosodic Event Distribution ToBI Prosodic Labels Pitch Accents,
Phrase Accent/Boundary Tones 8/26/13 6 Accent Type Distribution
Phrase Ending Distribution
Slide 9
Sequence Modeling SRILM 3-gram model Backoff & GT smoothing
Clusters learned over all material Sequence models trained over
train sets 8/26/13 7
Slide 10
Experiments Speaking Style, Nativeness, Speaker Recognition
Evaluation 500 samples between 10-100 syllables (~2-20 seconds)
ToBI, K-Means, DPGMM, DPGMM (removing the largest cluster) 5 fold
Cross-validation to learn hyperparameters Classification Train one
SRILM model per class. Classify by lowest perplexity Outlier
Detection Train a single model. Classifier learns a perplexity
threshold 8/26/13 8
Slide 11
Data Boston Directions Corpus READ, SPONTANEOUS 4 speakers
(used for Speaker Classification) Boston University Radio News
Corpus BROADCAST NEWS 6 speakers Columbia Games Corpus SPONTANEOUS
DIALOG 13 speakers Native Mandarin Chinese Speakers reading BURNC
stories. 4 speakers All ToBI Labeled 8/26/13 9
Slide 12
Features Villing (2004) pseudosyllabification Syllables with
mean intensity below 10dB are considered silent 7 Features Mean
range normalized intensity Mean range normalized delta intensity
Mean z-score normalized log f0 Mean z-score normalized delta log f0
Syllable duration Duration of previous silence (if any) Duration of
following silence (if any) 8/26/13 10
Slide 13
Consistency with ToBI labels V-Measure between ToBI Accent
Types and clusters ToBI Intonational Phrase-ending Tones and
clusters K-means, solid line DPGMM, gray line for reference (doesnt
vary by more than 0.001) 8/26/13 11 AccentingPhrasing
Slide 14
Speaking Style Recognition 4 styles: READ, SPON, BN, DIALOG
Single speaker for evaluation. 8/26/13 12 Classification Outlier
Detection - Dialog
Slide 15
Nativeness Recognition Native (BURNC) vs. Non-Native Single
speaker for evaluation. 8/26/13 13 Classification Outlier Detection
- Native
Slide 16
Speaker Recognition 4 BDC Speakers 6 tasks for training, 3 for
testing 8/26/13 14 Classification Outlier Detection 6 BURNC
Speakers Detect f2b vs. others
Slide 17
Conclusions K-means works well to represent prosodic
information DPGMM does not work so well out-of-the-box. Despite
being non-parametric, hyperparameter setting is still critically
important Future Work Larger acoustic/prosodic feature set.
requires pre-processing Evaluating the universality of prosodic
representations Integration of K-means and DPGMM. Use one to seed
the other. 8/26/13 15