IBM Labs in Haifa © 2007 IBM CorporationSSW-6, Bonn, August 23th, 2007
Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System
Slava Shechtman
IBM Haifa Research Laboratory
IBM Labs in Haifa
© 2007 IBM Corporation 2
Outline
CART intonation modeling Maximal Likelihood Dynamic intonation model
Dynamic observations Maximum-likelihood solution
Microprosody preservation technique Implementation and preliminary results Future research directions
IBM Labs in Haifa
© 2007 IBM Corporation 3
CART prosody modeling
pitchtree
grow duration tree
durationtree
Semanticdata
Syllable location
Syntactic data
Phonetic context
grow pitch tree
language data
Speech corpus with
pitch data
IBM Labs in Haifa
© 2007 IBM Corporation 4
Basic CART intonation model Rough, but simple and automatic Extract semantic, syntactic and phonetic
features from the TTS Front-end (per syllable) POS, word stress, syllable stress Sentence type, phrase type Syllable location Phonetic context
3 log-pitch observations per syllable (in a sonorant part of syllable)
Mean pitch values are associated with tree leaves to represent the target intonation (implicit i.i.d. assumption)
Q1
Q3Q2
IBM Labs in Haifa
© 2007 IBM Corporation 5
Basic application of CART intonation model
Use mean log-pitch values to estimate target pitch for concatenated segments
Use distance from the target pitch cost as an additive factor in the overall segment selection cost
Optionally, use the above target pitch curve for speech synthesis (after smoothing and/or combination with the actual pitch from the selected segments)
IBM Labs in Haifa
© 2007 IBM Corporation 6
Maximal Likelihood Dynamic intonation model
Model cross-syllable dynamic observations as well as intra-syllable observations
Maximum Likelihood solution, based on HMM synthesis approach (Tokuda et al) convenient framework for combining both instantaneous and
differential observations in order to obtain the most-likely smooth parameter contour, for a given clustering.
May be applied over the regular CART trees S1 S2 S3
IBM Labs in Haifa
© 2007 IBM Corporation 7
Dynamic features for CART intonation modeling
Extend the static observation vectors for n-th syllable, Add four time-normalized differences of static observations Guarantee non-zero time interval between the observation instances New observation vector
1( )nP
t
1( 1)n P 1( )nP 1( 1)n P
( )startT n ( )midT n ( )endT n
Pairs of observation points for difference calculation
2 1( ) ( ) ( )n n nP W P
(→)
IBM Labs in Haifa
© 2007 IBM Corporation 8
Maximal Likelihood Dynamic intonation model
Assume a cluster sequence Q is predetermined by CART Each cluster is modeled by a single 7-dim Gaussian ( ) Concatenated observations:
Concatenated static observations:
Sparse (block diagonal) linear transformation:
2 ( ) ( , )n nn P μ U
2 2 2( 1) ( ) ( 1)T T Tn n n O P P P
1 1 1( 1) ( ) ( 1)T T Tn n n C P P P
O WC
IBM Labs in Haifa
© 2007 IBM Corporation 9
Maximal Likelihood Dynamic intonation model
The log-likelihood of O sequence is given by
Where
1 11log ( | ) ,
2T TP K O Q O U O O U M
1 1 1 11
1
diag[ , , , , ]
[ , , , , ] .
n N
T T T Tn N
U U U U
M μ μ μ
IBM Labs in Haifa
© 2007 IBM Corporation 10
Maximal Likelihood Dynamic intonation model
Likelihood Minimization with respect to static observations C
An efficient time-recursive solution exists (Tokuda et al, 1996) Jointly determine full utterance pitch curve. The solution depends both on individual CART cluster models and on
their sequence in the synthesized sentence
1 1T T T W U WC W U M
IBM Labs in Haifa
© 2007 IBM Corporation 11
Maximal Likelihood Dynamic intonation model
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8130
140
150
160
170
180
190
200
210
220
230
mean solutuionML dynamicsolution
Smoothes abrupt changes existing in the mean solution Controlled by the scaling factor inside dynamic observations Allows usage of larger CART trees for fine clustering
(→)
IBM Labs in Haifa
© 2007 IBM Corporation 12
Microprosody preservation
Improve rough pitch curve resolution Keep original fine pitch structure inside the contiguous portion of speech
to increase naturalness, but be aligned with the target intonation curve Compensate for the imperfectness of the CART model and feature
extraction
F0
[log
Hz]
IBM Labs in Haifa
© 2007 IBM Corporation 13
Mean solution vs. ML dynamic intonation model
Mean solution : ML dynamic solution
Pref. No pref.
Static, smoothed (A) Dynamic ML (B)
All pref.Strongpref.
All pref.Strongpref.
% 22.5 34.3 3.2 43.2 9.2
IBM Labs in Haifa
© 2007 IBM Corporation 14
Incorporation within CTTS system
Applied on embedded version of IBM CTTS system with sub-phoneme basic concatenation unit (regularly one third of a phoneme)
(A): CART mean solution as a target pitch, smoothed original pitch curve as a synthesis pitch.
(B): dynamic ML CART solution as a target pitch, use the microprosody preservation technique to combine original and target pitches
TTS experts + native speakers subjective results
Pref. No pref.
(A) (B)
Strong or weak pref.
Strong pref.
Strong or weak pref.
Strong pref.
% 37.9 27.7 3.2 34.4 6.7
IBM Labs in Haifa
© 2007 IBM Corporation 15
Summary and further research directions
Dynamic ML CART intonation model was proposed and shown to perform better then the baseline CART intonation.
It was successfully combined with the original pitch curve using microprosody preservation technique.
Further research Alternative dynamic features Statistical microprosody modeling for very-small-footprint voices Adaptive microprosody incorporation