Proposals for Proposals for Extending Extending SSML 1.0 SSML 1.0 from the Point-of- from the Point-of- View of Hungarian TTS View of Hungarian TTS Developers Developers Géza Németh, Géza Kiss, Bálint Tóth Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications Laboratory of Speech Technology, Department of Telecommunications and Media Informatics and Media Informatics Budapest University of Technology and Economics, Budapest, Budapest University of Technology and Economics, Budapest, Hungary Hungary
40
Embed
Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Proposals for Proposals for Extending Extending SSML 1.0 SSML 1.0
from the Point-of-View from the Point-of-View of Hungarian TTS of Hungarian TTS
DevelopersDevelopersGéza Németh, Géza Kiss, Bálint TóthGéza Németh, Géza Kiss, Bálint Tóth
Laboratory of Speech Technology, Department of Laboratory of Speech Technology, Department of Telecommunications and Media InformaticsTelecommunications and Media Informatics
Budapest University of Technology and Economics, Budapest University of Technology and Economics, Budapest, HungaryBudapest, Hungary
AppliApplied Researched Research Fully proprietaryFully proprietary components and components and solutions: solutions:
All parameters controlled, systems are tailor-made for All parameters controlled, systems are tailor-made for the end-userthe end-user, , Integration of original research results, Integration of original research results, unique productsunique products
T-Mobile Hungary services: T-Mobile Hungary services: E-mail reader 1999-, name- E-mail reader 1999-, name- and address reader and address reader in reverse directory, in reverse directory, 20032003 (Motto: Why (Motto: Why is the human operator speaking, not the machine?!)is the human operator speaking, not the machine?!),, Symbian SMS-reader Symbian SMS-reader 202002- (STL)02- (STL)
Hungarian VoiceXML browserHungarian VoiceXML browser,, 2003 2003, TSP+STL, TSP+STL)) Industrial information systems (STL, TSP)Industrial information systems (STL, TSP) UUnified Messagingnified Messaging (STL) (STL) Call CenterCall Center (STL, TSP) (STL, TSP) Audio user interfaces (especially portable/mobile devices, Audio user interfaces (especially portable/mobile devices,
car information systems, wearable devices, STL, TSP)car information systems, wearable devices, STL, TSP) Disability (Disability (1986-1986-, speech, vision, Hungarian version of , speech, vision, Hungarian version of
Jaws for Windows, notetaker for blind people, STL, TSP, Jaws for Windows, notetaker for blind people, STL, TSP, LSA)LSA)
Text structure elements already Text structure elements already contained contained in SSML 1.0: in SSML 1.0:
paragraphparagraph sentencesentence
Suggested further structuring:Suggested further structuring: wordword syllablessyllables
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
This can be usedThis can be used to helpto help
text-to-phoneme conversiontext-to-phoneme conversion prosody prediction and prescriptionprosody prediction and prescription ……
by giving higher level information, by giving higher level information, namelynamely syllable structuresyllable structure part-of-speech informationpart-of-speech information
(Examples given later)(Examples given later) to indicate wordsto indicate words in languages that in languages that
do not use space to separate wordsdo not use space to separate words
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Reasons to useReasons to use text structure text structure elements elements instead of e.g. instead of e.g. phonemephoneme, , prosodyprosody, , breakbreak, , emphasisemphasis
Easier for human editor to addEasier for human editor to add Replacing synthesis processor may Replacing synthesis processor may
E.g. three kilosE.g. three kilos <e<e POS=“cardinal” number=“plural” POS=“cardinal” number=“plural”
gender=“neuter” case=“genitive”]> gender=“neuter” case=“genitive”]> 3 3 kk. </e>. </e>
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
When pronunciation cannot be When pronunciation cannot be determined, you candetermined, you can
1.1. Add a Add a lexiconlexicon element elementBUT hard to add all BUT hard to add all
2.2. Specify using Specify using phonemephoneme::BUT hard to write & read for humanBUT hard to write & read for human
3.3. Add a textual replacement using Add a textual replacement using subsub
4.4. Provide higher level information Provide higher level information Currently this is only Currently this is only say-assay-as
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Other types of higher level information Other types of higher level information (easier, more natural)(easier, more natural)
Syllable structureSyllable structure Part-of-speech informationPart-of-speech information Language of included foreign textLanguage of included foreign text
We are going to give you some We are going to give you some examples.examples.
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Hungarian: Hungarian: highly agglutinativehighly agglutinative pronunciation inference rules are usedpronunciation inference rules are used rules can be tricked by some wordsrules can be tricked by some words
E.g. “egészség” (“health”)E.g. “egészség” (“health”)
Letter combinations might beLetter combinations might be “s+zs”“s+zs” [[]+]+[[]→[]→[]]
but they are in factbut they are in fact “sz+s”“sz+s” [[]+]+[[]→[]→[]]
Syllable structureSyllable structure
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Enough to know syllable structure. Enough to know syllable structure.
Instead of Instead of <phoneme alphabet="ipa" <phoneme alphabet="ipa"
If both If both langlang and and phph is given, is given, langlang has has prioritypriority
If language is If language is “x-unknown”“x-unknown”, , LID (language identification) is used.LID (language identification) is used.
We suggest that We suggest that “x-unknown”“x-unknown” c can be an be used with used with xml:langxml:lang also. also.
LanguageLanguage
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Text normalization effectively assisted Text normalization effectively assisted by by say-assay-as element. element.
The constructs we found appropriate The constructs we found appropriate in our practice include:in our practice include:datedate, , timetime (including time intervals like (including time intervals like opening hours), opening hours), numbernumber, , currencycurrency, , namename, , addressaddress. .
Additionally Additionally suggest as suggest as standard standard values: values: acronym/abbreviationacronym/abbreviation, , webweb, , e-e-mailmail, , phonephone, , program-codeprogram-code, , tabletable, , equationequation..
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
We speak differently in different We speak differently in different situationssituations(e.g. speaking with friends, giving a talk (e.g. speaking with friends, giving a talk at a conference, reading news, reading at a conference, reading news, reading stories to children) – speaking stylestories to children) – speaking style
Differences in prosody can be quantifiedDifferences in prosody can be quantified Emotional speech also in the focus of Emotional speech also in the focus of
researchresearch Modern TTS systems are likely to be able Modern TTS systems are likely to be able
to imitate these to some extentto imitate these to some extent
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Suggested Suggested speaking-stylespeaking-style attribute attribute Can be used where the Can be used where the xml:langxml:lang element, element,
i.e. i.e. voicevoice, , speakspeak, , pp, , ss, , ww Synthesis processors can define their own Synthesis processors can define their own
set of supported speaking-stylesset of supported speaking-styles They should support: They should support: "spelling""spelling"
– can be viewed a special reading style – can be viewed a special reading style They may support e.g. They may support e.g. "syllabification""syllabification", , "causal""causal", , "news reading""news reading", , "story telling""story telling"
Speaking styleSpeaking style
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Suggested Suggested emotionemotion attribute attribute Mentioned here, although prosody is only Mentioned here, although prosody is only
one of its aspectsone of its aspects Complementary to speaking-style, Complementary to speaking-style,
therefore separate attribute is suggested therefore separate attribute is suggested Can be used where the Can be used where the xml:langxml:lang element, element,
i.e. i.e. voicevoice, , speakspeak, , pp, , ss, , ww Possible values: "Possible values: "happinesshappiness", "", "sadnesssadness", ",
Part-of-speech (POS) of word may affect Part-of-speech (POS) of word may affect emphasis and other aspects of prosodyemphasis and other aspects of prosody
Not always possible to automatically determineNot always possible to automatically determine More desirable to specify POS than to More desirable to specify POS than to
““Igaz, Igaz, hogyhogy jól vagy? jól vagy?” (“Is it true ” (“Is it true thatthat you are you are alright?”)alright?”) – conjunction,– conjunction, reduced emphasisreduced emphasis
Part-of-speechPart-of-speech
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Analytic languages (e.g. English, Analytic languages (e.g. English, Chinese)Chinese) Words are usually short Words are usually short They convey only one portion of the meaningThey convey only one portion of the meaning Individual words can be stressedIndividual words can be stressed
Synthetic languages (e.g. Hungarian, Synthetic languages (e.g. Hungarian, Korean)Korean) Words are often longWords are often long Made up of several morphemes and have Made up of several morphemes and have
very complex meaningsvery complex meanings Stress, pitch changes, etc. may need to be Stress, pitch changes, etc. may need to be
realized on certain morphemes (~syllables)realized on certain morphemes (~syllables)
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Example 1: contrastive sentencesExample 1: contrastive sentences English:English:
“The book is not “The book is not inin the box, but the box, but onon the box.” the box.” Speaker can Speaker can emphasize one wordemphasize one word..
Hungarian:Hungarian:““Nem a dobozNem a dobozonon, hanem a doboz, hanem a dobozbanban van a könyv.” van a könyv.” Speaker sometimes has to Speaker sometimes has to emphasize one emphasize one
syllablesyllable. . Stress expressed mainly by pitch; may be Stress expressed mainly by pitch; may be
aided aided by short pause, slower rate, higher volume.by short pause, slower rate, higher volume.
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Example 2: pitch change on syllableExample 2: pitch change on syllable1.1. ““Elmentek.Elmentek.” – “They are gone” – “They are gone..” ”
Pitch is continuously fallingPitch is continuously falling2.2. ““Elmentek?Elmentek?” – “Are they gone?”” – “Are they gone?”
Pitch rises at the beginning of the Pitch rises at the beginning of the second syllable and falls down on the second syllable and falls down on the third syllablethird syllable
1. 2.
Overview
Text-to-phoneme conversion
Text structure
Prosody prescription
Summary
Text normalization
Prosody prediction
Suggestion for extensions to prosody:Suggestion for extensions to prosody: Stress and prosody can be described Stress and prosody can be described
on a on a per-syllable basisper-syllable basis Extension to prosody: time can be Extension to prosody: time can be
syllable positionsyllable position decimal fractions can also be useddecimal fractions can also be used negative values indicate nnegative values indicate nthth position from position from
endend special symbol syl_end indicates end of special symbol syl_end indicates end of
Suggestion for optional extensions:Suggestion for optional extensions:
some synthesis processors may processsome synthesis processors may process pitch-contourpitch-contour (= (=contourcontour), ), rate-contourrate-contour, , volume-contourvolume-contour
time positions: the same as in time positions: the same as in contourcontour
rate / volume: described as in rate / volume: described as in raterate / / volumevolume emphasisemphasis and and breakbreak extended with a extended with a positionposition attribute; value can be syllable attribute; value can be syllable position.position.In this case In this case breakbreak will not be an empty will not be an empty element.element.