SSW6 Bonn Aug. 2007 Communicative Speech Synthesis with XIMERA: a First Step Shinsuke Sakai 1,2 , Jinfu Ni 1,2 , Ranniery Maia 1,2 , Keiichi Tokuda 1,3 , Minoru Tsuzaki 1,4 ,Tomoki Toda 1,5 , Hisashi Kawai 2,6 , Satoshi Nakamura 1,2 1 NiCT, Japan 2 ATR-SLC, Japan 3 Nagoya Institute of Technology, Japan 4 Kyoto City University of Arts, Japan 5 Nara Institute of Science and Technology, Japan 6 KDDI Research and Development Labs, Japan
15
Embed
SSW6 Bonn Aug. 2007 Communicative Speech Synthesis with XIMERA: a First Step Shinsuke Sakai 1,2, Jinfu Ni 1,2, Ranniery Maia 1,2, Keiichi Tokuda 1,3,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SSW6 Bonn Aug. 2007
Communicative Speech Synthesis with XIMERA: a First Step
waveforms.• Ex. -3: sounds like a bad news, -2: pretty sure that it sounds
like a bad news, -1: rather sounds like a bad news, 0: no distinction, …
SSW6 Bonn Aug. 2007
Experiments: results
Observations(1) Intended style perception well achieved while
maintaining a good naturalness. “Good news” recognized 66.7%, MOS 3.6 (G-G2)“Bad news” recognized 98.4%, MOS 2.9 (B-B2)
(2) “good news” styles sounded more natural to listeners.
“good news” more similar to neutral (..?)
(3) Clearer style perception for “bad news”.
1 2 3 4 5 6 7 8
(2)
(3)
(1)
(1)
SSW6 Bonn Aug. 2007
Experiments: results (cont’d)
Other observations
(1) Target alone is not enough. Unit DB for the specific style makes difference.
(2) Addition of neutral data doesn’t improve naturalness (a little degradation instead).
(3) Speech with good/bad news styles sounded more natural if developed with the same amount of data.
1 2 3 4 5 6 7 8
(2) (2)
(3) (1)
SSW6 Bonn Aug. 2007
Experiments: F0-related observations
• Natural F0 for:–“bad news” speech:
• F0 mean is low• dynamic range is
narrower
–“good news” speech:• F0 mean a little
higher.• Dynamic range little
wider.
SSW6 Bonn Aug. 2007
Conclusion
• Initial attempt at communicative speech synthesis with “good news” and “bad news” styles using 3 hours of each style-specific corpora.
• Intended style perception well achieved while maintaining a good naturalness. “Good news” recognized at 66.7% with MOS 3.6 (G-G2). “Bad news” recognized at 98.4% with MOS 2.9 (B-B2).
• Not only target models but also unit databases with specific styles were effective in synthesizing speech in the intended corresponding styles.
• Plan to investigate contributions from each of spectral, F0, and duration features separately, instead of the models themselves.
SSW6 Bonn Aug. 2007
appendices
SSW6 Bonn Aug. 2007
(appendix) test sentences ・ neutral, good news, bad news いずれの解釈も可能な文セットを用意