Talsyntes: Joakim Gustafson Tal, musik och hörsel 1 J. Gustafson, CTT, KTH Speech synthesis Speech synthesis Speech synthesis Speech synthesis DT2112 DT2112 DT2112 DT2112 Joakim Joakim Joakim Joakim Gustafson, CTT, KTH Gustafson, CTT, KTH Gustafson, CTT, KTH Gustafson, CTT, KTH School for Computer Science and Communication School for Computer Science and Communication School for Computer Science and Communication School for Computer Science and Communication Many slides prepared by Many slides prepared by Many slides prepared by Many slides prepared by Olov Olov Olov Olov Engwall Engwall Engwall Engwall (and others) (and others) (and others) (and others) J. Gustafson, CTT, KTH 2 Text Text Text Text-To To To To-Speech Speech Speech Speech synthesis synthesis synthesis synthesis (TTS) (TTS) (TTS) (TTS) The automatic generation of synthesized sound or visual output from any phonetic string. Our focus in this course! J. Gustafson, CTT, KTH 3 Different kinds of speech synthesis Different kinds of speech synthesis Different kinds of speech synthesis Different kinds of speech synthesis • Recorded speech – Words or phrases (telephone banking) – Fixed vocabulary – maintenance problems… • Concatenative speech synthesis • Parametric synthesis • Multimodal synthesis J. Gustafson, CTT, KTH 4 What What What What a a a a synthesiser synthesiser synthesiser synthesiser is to is to is to is to convey convey convey convey • The linguistic component: semantic information that is part of the speaker’s language (e.g. question intonation) • The paralinguistic component: the speaker’s attitudinal or emotional states, sociolect and regional dialect. • The extralinguistic component: the individuality, gender and age of a certain speaker. It can be judged independently of the language. To adapt a speech synthesizer to a certain speaker, we need both the para- and extralinguisitic components. J. Gustafson, CTT, KTH Desireable synthesis features Desireable synthesis features Desireable synthesis features Desireable synthesis features from a dialogue perspective from a dialogue perspective from a dialogue perspective from a dialogue perspective • Real-time, Incremental, Interruptable • Explicit control of prosodic parameters – Fundamental frequency – Intensity – Natural sounding lengthening, hesitation, interruptions • Generation of extra-linguistic sounds – Filled pauses – Creeks/Gargles – Smacks/Inhalations/exhalations to give turn J. Gustafson, CTT, KTH 6 The synthesis space The synthesis space The synthesis space The synthesis space Intelligibility Naturalness Bit rate Vocabulary Units Complexity Processing needs Flexibility Speech Knowledge Cost
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
School for Computer Science and CommunicationSchool for Computer Science and CommunicationSchool for Computer Science and CommunicationSchool for Computer Science and CommunicationMany slides prepared by Many slides prepared by Many slides prepared by Many slides prepared by OlovOlovOlovOlov EngwallEngwallEngwallEngwall (and others)(and others)(and others)(and others)
Different kinds of speech synthesisDifferent kinds of speech synthesisDifferent kinds of speech synthesisDifferent kinds of speech synthesis
• Recorded speech
– Words or phrases (telephone banking)
– Fixed vocabulary – maintenance problems…
• Concatenative speech synthesis
• Parametric synthesis
• Multimodal synthesis
J. Gustafson, CTT, KTH 4
WhatWhatWhatWhat a a a a synthesisersynthesisersynthesisersynthesiser is to is to is to is to conveyconveyconveyconvey
• The linguistic component: semantic information that is part of the speaker’s language (e.g. question intonation)
• The paralinguistic component: the speaker’s attitudinal or emotional states, sociolect and regional dialect.
• The extralinguistic component: the individuality, gender and age of a certain speaker. It can be judged independently of the language.
To adapt a speech synthesizer to a certain speaker, we need both the para- and extralinguisitic components.
J. Gustafson, CTT, KTH
Desireable synthesis features Desireable synthesis features Desireable synthesis features Desireable synthesis features from a dialogue perspectivefrom a dialogue perspectivefrom a dialogue perspectivefrom a dialogue perspective
• Real-time, Incremental, Interruptable
• Explicit control of prosodic parameters– Fundamental frequency
Intonation: F0 contourIntonation: F0 contourIntonation: F0 contourIntonation: F0 contourLarge pitch range (female)Authoritive (final fall)Emphasis for Finance (H*)Final has a raise – more information to come
• Word stress and sentence intonation
– each word has at least one syllable which is spoken with higher prominence
– in each phrase the stressed syllable can be accented depending on the semantics and syntax of the phrase
• Prosody relies on syntax, semantics, pragmatics: personal reflection of the reader.
By ConcatenationBy ConcatenationBy ConcatenationBy ConcatenationElementary speech units are stored in a database and then concatenated and processed to produce the speech signal
By RuleBy RuleBy RuleBy RuleSpeech is produced by mathematical rules that describe the influence of phonemes on one another
J. Gustafson, CTT, KTH
Research trends in Research trends in Research trends in Research trends in speech synthesisspeech synthesisspeech synthesisspeech synthesis
From articulation to acousticsFrom articulation to acousticsFrom articulation to acousticsFrom articulation to acoustics
Transfer function
Vocal tract model
Tubes
Waveform
Cross-sections
3D air flow calculations
Area function
J. Gustafson, CTT, KTH
Benefits:Benefits:Benefits:Benefits:• Speech production in the same way as humans• Can be made with few parameters• The changes are intuitive
(raise the tongue tip, round the lips)
Disadvantages:Disadvantages:Disadvantages:Disadvantages:• Computationally demanding• Problems with consonants• Articulatory measurements required• State-of-the-art articulatory synthesis still sounds bad
Benefits:Benefits:Benefits:Benefits:• Possible to change the voice to get different:
• speakers•emotions•voice qualities
• Small footprint
Disadvantages:Disadvantages:Disadvantages:Disadvantages:• Hard to achieve naturalness in voice source• Some consonant sounds are hard to model with formants (bursts)
J. Gustafson, CTT, KTH 30
From rule based From rule based From rule based From rule based concatenativeconcatenativeconcatenativeconcatenative synthesissynthesissynthesissynthesis
• Rule based sounds unnatural, while concatenative synthesis provides
(piece-wise) high quality speech.
• Certain sounds are hard to be produced by rule but easy to concatenate:
– Bursts, voiceless stops are too difficult
• Rule based had an advantage of small footprint, however storing the
segment database is no longer an issue
• Change of applications:
– From reading machines for the blind to spoken dialogue systems
Talsyntes: Joakim Gustafson
Tal, musik och hörsel
6
J. Gustafson, CTT, KTH 31
SynthesisSynthesisSynthesisSynthesis by by by by concatenationconcatenationconcatenationconcatenation
J. Gustafson, CTT, KTH 32
Let’sLet’sLet’sLet’s get the terms straightget the terms straightget the terms straightget the terms straight
ConcatenativeConcatenativeConcatenativeConcatenative synthesissynthesissynthesissynthesisDefinition: Definition: Definition: Definition: All kinds of synthesis based on the concatenation of units, regardless of
type (sound, formant trajectories, articulatory parameters) and size (diphones, triphones, syllables, longer units). There is only one candidate only one candidate only one candidate only one candidate per setting.
Unit selection synthesisUnit selection synthesisUnit selection synthesisUnit selection synthesisDefinition: Definition: Definition: Definition: All kinds of synthesis based on the concatenation of units where there are several candidates several candidates several candidates several candidates to choose from, regardless of if the candidates have the same, fixed size or if the size is variable.
Database preparation when Database preparation when Database preparation when Database preparation when building a building a building a building a concatenativeconcatenativeconcatenativeconcatenative synthesissynthesissynthesissynthesis
• Choose the speech units (Phone, Diphone, Sub-word unit, Cluster based unit selection)
• Compile and record utterances
• Segment signal and extract speech units
• Store segment waveforms (along with context) and information in a database:
Dictionary, waveform, pitch mark
e.g. “ch-l r021 412.035 463.009 518.23”
diphone file Start time Middle time End
• Pitch mark file: a list of each pitch mark position in the file
• Extract parameters; create parametric segment
database (for data compaction and prosody matching)
Signal manipulations in Signal manipulations in Signal manipulations in Signal manipulations in concatenativeconcatenativeconcatenativeconcatenative synthesissynthesissynthesissynthesis
• Prosodic modifications– Possibility to modify F0
– Possibility to lengthen or shorten segments
• Spectral modifications– Interpolation of spectrum at joints
J. Gustafson, CTT, KTH 35
Sequences of a particular sound/phone in all its environmentsof occurrence or all/most two-phone sequences occurring in alanguage: _auto_ -> _a, au, ut, to, o_
• Rationale: the center’ofcenter’ofcenter’ofcenter’of a phonetic realization is the moststablestablestablestable region, whereas the transition from one segment toanother contains the most interesting phenomena, and is thusthe hardest to model.
• Problems:– Signal processing still necessary for modifying durations
– Source data is still not natural
– Units are just not large enough; can’t handle word-specific effects, etc
Slide from Dan Jurafsky
J. Gustafson, CTT, KTH
From diphone synthesis to From diphone synthesis to From diphone synthesis to From diphone synthesis to
Unit Selection SynthesisUnit Selection SynthesisUnit Selection SynthesisUnit Selection Synthesis
• Natural data solves problems with diphones– Diphone databases are carefully designed but:
• Speaker makes errors• Speaker doesn’t speak intended dialect• Require database design to be right
– If it’s automatic• Labeled with what the speaker actually said• Coarticulation, schwas, flaps are natural
• “There’s no data like more data”– Lots of copies of each unit copies of each unit copies of each unit copies of each unit mean you can choose just the
right one for the context– Larger units Larger units Larger units Larger units mean you can capture wider effects
Unit Selection IntuitionUnit Selection IntuitionUnit Selection IntuitionUnit Selection Intuition
• Given a big database
• For each segment that we want to synthesize– Find the unit in the database that is the best to synthesize this target
segment
• What does “best” mean?– Target cost: Target cost: Target cost: Target cost: Closest match to the target description, in terms of
• Phonetic context
• Pitch, power, duration, phrase position
– Concatenation cost: Concatenation cost: Concatenation cost: Concatenation cost: The difference between the end of diphone 1 and the start of diphone 2:
• Matching formants + other spectral characteristics
• Matching energy
• Matching F0
Slide from Dan Jurafsky
Assignment 1: Practical exercises with the calculation of target and concatenation cost.
Mahalanobis distance Mahalanobis distance Mahalanobis distance Mahalanobis distance is useful in when multi-normal distributions lead to non spherically symmetric distributions
J. Gustafson, CTT, KTH 50
The units in Unit SelectionThe units in Unit SelectionThe units in Unit SelectionThe units in Unit Selection
• Different types Different types Different types Different types of units: e.g. diphones, phones, syllables, words, etc.
• Multiple occurrences Multiple occurrences Multiple occurrences Multiple occurrences of the units cover a wide space of the spectral and prosodic parameters
• Units nearest in this space to the targets will be chosen and will require only minor modificationminor modificationminor modificationminor modification
• The corpus is segmented into phonetic units, indexed, and used asindexed, and used asindexed, and used asindexed, and used as----isisisis
• Selection is made onononon----linelinelineline
• The trend is towards longer and longer unitslonger unitslonger unitslonger units
1999 2000 2001 2002 2003 2004 2005
Sound (OLE2) Sound (OLE2)Sound (OLE2) (.wav)
(.wav) (.wav)
J. Gustafson, CTT, KTH 51
• Large databases of recorded natural speech
• Minimal processing
• Annotation of database – what information is needed?
• Few cuts > maximally long units selected(but context and prosody must fit well)
• Target and concatenation costs
Slide from Dan Jurafsky
Features of Unit Selection SynthesisFeatures of Unit Selection SynthesisFeatures of Unit Selection SynthesisFeatures of Unit Selection Synthesis
• Unlike diphone synthesis, prosodic variation is a good thing
• Accurate annotation is crucial
• Pitch annotation needs to be very accurate
• Phone alignments can be done automatically, as described for
diphones
Slide from Dan Jurafsky
Talsyntes: Joakim Gustafson
Tal, musik och hörsel
10
J. Gustafson, CTT, KTH
Practical System IssuesPractical System IssuesPractical System IssuesPractical System Issues
• Size of typical system (Rhetorical rVoice):– ~300M
• Speed:– For each diphone, average of 1000 units to choose from, so:
– 1000 target costs
– 1000x1000 join costs
– Each join cost, say 30x30 float point calculations
– 10-15 diphones per second
– 10 billion floating point calculations per second
• But commercial systems must run ~50x faster than real time
• Heavy pruning essential: – 1000 units -> 25 units
Slide from Paul Taylor J. Gustafson, CTT, KTH
Summary: Unit SelectionSummary: Unit SelectionSummary: Unit SelectionSummary: Unit Selection
• Advantages– Quality is far superior to diphones
– Natural prosody selection sounds better
– Non-linguistic features of the speakers voice built in
• Disadvantages:– Fixed voice
– Quality can be very bad in places
• HCI problem: mix of very good and very bad is quite annoying
– Large footprint, itis computationally expensive
– Can’t synthesize everything you want:• Diphone technique can move emphasis
• Unit selection gives good (but possibly incorrect) result
Slide from Richard Sproat
J. Gustafson, CTT, KTH
From Unit selection From Unit selection From Unit selection From Unit selection
to HMM synthesisto HMM synthesisto HMM synthesisto HMM synthesis• Problems with Unit Selection Synthesis
– Discontinuities:Discontinuities:Discontinuities:Discontinuities: Can’t modify signal
– Hit or miss: Hit or miss: Hit or miss: Hit or miss: database often doesn’t have exactly what you want
– Fixed voice Fixed voice Fixed voice Fixed voice
• Solution: HMM (Hidden Markov Model) Synthesis– Stable, Smooth and easy to create multiple voicesStable, Smooth and easy to create multiple voicesStable, Smooth and easy to create multiple voicesStable, Smooth and easy to create multiple voices
– Sounds unnatural to researchers, but naïve subjects prefer itnaïve subjects prefer itnaïve subjects prefer itnaïve subjects prefer it
Example: Nina as unit selection and HMM synthesis voice
• A HMM is a machine, with a limited number of possible states.
• The transition between two states is regulated by probabilities.
• Every transition results in an observation with a certain probability.
• The states are hidden, only the observations are visible.
PiiPij
Pjj
Pjk
Pjk
Pkl
Pll
Oi OjOk Ol
J. Gustafson, CTT, KTH 60
HMMsHMMsHMMsHMMs in in in in synthesissynthesissynthesissynthesis
Talsyntes: Joakim Gustafson
Tal, musik och hörsel
11
J. Gustafson, CTT, KTH 61
RelationshipRelationshipRelationshipRelationship betweenbetweenbetweenbetween unitunitunitunitselectionselectionselectionselection and HMM and HMM and HMM and HMM synthesissynthesissynthesissynthesis
Slide from Tokuda J. Gustafson, CTT, KTH 62
RelationshipRelationshipRelationshipRelationship betweenbetweenbetweenbetween unitunitunitunitselectionselectionselectionselection and HMM and HMM and HMM and HMM synthesissynthesissynthesissynthesis 2222
Slide from Tokuda
J. Gustafson, CTT, KTH 63
TheTheTheThe trainingtrainingtrainingtraining partpartpartpart• The training is automatic. You need:
– The text + recordings of about 1000 sentences
• The training of 1000 sentences– takes 24 hours and generates a voice of less than 1 MB
• Separate HMMs for: Spectrum, F0, Duration
• Training in two steps:1. Context independent models2. Use these models to create context dependent models.
• Clustering:– Storing all contexts requires much space– It may be difficult to find alternatives for missing models– Many models are very similar = redundancy
J. Gustafson, CTT, KTH 64
ExamplesExamplesExamplesExamples of features in HMM of features in HMM of features in HMM of features in HMM synthesissynthesissynthesissynthesis trainingtrainingtrainingtraining
• Segment features: – immediate context– position in syllable
• Syllable features – Stress and lexical accent type– position in word and phrase
• Word features – number of syllables– position in phrase– morphological feature (compound or not)– part-of-speech tag (content or function word)
• Phrase features– phrase length in terms of syllables and words
• Utterance features:– length in syllables, words and phrases
• Speaker– Dialect, speaking style, emotional state
J. Gustafson, CTT, KTH 65
ClusteringClusteringClusteringClustering• Groups a large database into clusters
• Three trees: Duration, F0 and Spectrum
• Division based on yes/no questions– Grouping acoustic similar phonemes