Page 1
eNTERFACEeNTERFACE’’05, Friday Aug. 5th05, Friday Aug. 5th
SPEECH SYNTHESIS : SPEECH SYNTHESIS : UP FROM STATEUP FROM STATE--OFOF--THETHE--ART ART
CORPUSCORPUS--BASED APPROACHES?BASED APPROACHES?
Provided to you equationProvided to you equation--free by:free by:
Thierry DutoitThierry [email protected] @fpms.ac.be
TCTS Lab Faculté Polytechnique de Mons Belgium
Page 2
TTS = NLP + DSPTTS = NLP + DSP
TEXT
SPEECH
DIGITAL SIGNALPROCESSING
NATURAL LANGUAGEPROCESSING
PhonesInt/Dur
TEXT-TO-SPEECH SYNTHESIZER
NarrowPhonetic
Transcription
PhonetizationSpeech
Synthesis
Intonation/Duration
Generation
(a) (b) (c)
_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…
To be or not to be, that is the
question.
Page 3
ChallengesChallenges
• Accurate automatic phonetization (≠dictionnary look-up)• Prosody generation(i.e., intonation and phoneme
durations) must be “coherent”; easy to produce unnatural prosody
• Synthesis of phoneme sequences with corresponding prosody
– Coarticulation! (~Harris, 53)– Segmental quality should be maintained after pitch and
duration modification• Engineering
– Low design and maintenance cost
– Low computational and memory cost
– Easy adaptation to other languages
Intelligible – Natural – Cost effective
Page 4
TTS = NLP + DSPTTS = NLP + DSP
TEXT
SPEECH
DIGITAL SIGNALPROCESSING
NATURAL LANGUAGEPROCESSING
PhonesInt/Dur
TEXT-TO-SPEECH SYNTHESIZER
NarrowPhonetic
Transcription
PhonetizationSpeech
Synthesis
Intonation/Duration
Generation
(a) (b) (c)
_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…
To be or not to be, that is the
question.
Page 5
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• Diphone concatenation
• Corpus-based (Unit Selection) Synthesis
• Is there a future after Corpus-based
synthesis?
Page 6
Von Von KempelenKempelen’’ss talking talking machine (1791)machine (1791)
Mouth
Nostrils
Main bellows
Small bellows
'S' pipe
'Sh' pipe
'Sh' lever'S' lever
(J.S. Liénard, LIMSI)
Page 7
Omer DudleyOmer Dudley’’s s VoderVoder(Bell Labs, 1936)(Bell Labs, 1936)
NoiseSource
Oscillator
Resonnance Cont rol Amplifier
106 7 8 9
"Quiet "
t -dp-bk-g
Energy swit chwrist bar
VoderConsoleKeyboard
1 2 3 4
5
Pit ch-cont rolpedal
UV
V
Page 8
John HolmesJohn Holmes’’ formant formant synthesizer (1964)synthesizer (1964)
Rule-based Synthesis
Haskins Labs (1968) DecTalk (1983)InfoVox (1983-95) LIMSI’s Polyglot (92)
Intelligibility Naturalness Mem/CPU New Voice(<100 kB)
Page 9
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative)
approach• Diphone concatenation
• Corpus-based (Unit Selection) synthesis
• Is there a future after corpus-based
synthesis?
Page 10
DiphoneDiphone concatenation (1977)concatenation (1977)
Page 11
DiphoneDiphone concatenation concatenation (1968)(1968)
DiphoneDat abase
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
AT&T: LPC (1980)
France Telecom : PSOLA (1990)
Page 12
The MBROLA project (95)The MBROLA project (95)
http://tcts.fpms.ac.be/synthesis/
Page 13
DiphoneDiphone concatenation (1977)concatenation (1977)
DiphoneDat abase
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Intelligibility Naturalness~ Mem/CPU New Voice ~(5 MB : “High DENSITY TTS”)
Page 14
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• Diphone concatenation
• Corpus-based (Unit Selection) synthesis
• Is there a future after corpus-based
synthesis?
Page 15
CorpusCorpus--based synthesisbased synthesis
DiphoneDat abase
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Diphone-based synthesis
Page 16
CorpusCorpus--based synthesisbased synthesis
(Univ. Edinburgh, 1997)
VERY LARGE
CORPUS
Prosody
Modificat ion
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Unit selection -coprus-based synthesis
(AT&T, 1998)(L&H, 1999)
(ATR, 1996)
(Loquendo, 2001)(Babel Technologies, 2003)
Page 17
CorpusCorpus--based synthesisbased synthesisHow to get the best sequence of units for a
given utterance? Viterbi search
Use a model, but give last word to the dataOr : Choose the best, modify the least
Target j
Unit i-1 Unit i+1
Concatenation cost cc(i-1,i)
Concatenation cost cc(i,i+1)
target cost tc(j,i)
Unit i
Intelligibility Naturalness Mem/CPU ~ New Voice(1 GB : “High QUALITY TTS”)
Page 18
ContentsContents
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• Diphone concatenation
• Corpus-based (Unit Selection) synthesis
• Is there a future after corpus-based
synthesis? ?
Page 19
Speech Science?Speech Science?This time is over– planes do not flap their wings– replace experts by corpora
cf. Jelinek ’s «Each time I fire a linguist my recognition rate goes 1% higher»
1. Future milestones in speech processing will come from labs with strong commitment to solid, portable, and extensible code;2. Speech scientists and software engineers will soon be the same people.
SpokenSpoken LanguageLanguage Engineering!Engineering!ICASSP-INTERSPEECH : “Speech” synthesis “Spoken Language” Synthesis
Page 21
HoweverHowever……• Engineering is now in the hands of companies
– Reduce the footprint of TTS systems (a few Megs)– Create new voices as fast as possible
• (Academic) TTS research? – Speech coding? +-DEAD– Voice conversion? YES– Speaker adaptation? YES– Expressive speech synthesis?
-Corpus-based : (ex: Loquendo)-DSP-based : eNTERFACE #6 ☺ Who will win?
• At FPMs : Back to acoustic speech modeling Voice quality analysis
– Breathy, Creaky, Diplophonic, Tense, Relaxed, etc.– Using acoustic features (spectral tilt, glottal formant
estimation, open quotient of glottal waveform, etc.)
Page 22
Summertime …and the living is easy…