Top Banner
18

Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

Aug 23, 2018

Download

Documents

hoangkhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 

Page 2: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 1 

 © Specialty Answering Service. All rights reserved. 

Contents

1  Introduction .......................................................................................................................................... 3 

1.1  Quality of a Speech Synthesizer .................................................................................................... 3 

1.2  The TTS System ............................................................................................................................. 3 

2  History ................................................................................................................................................... 4 

2.1  Electronic Devices ......................................................................................................................... 4 

3  Synthesizer Technologies ...................................................................................................................... 6 

3.1  Waveform/Spectral Coding ........................................................................................................... 6 

3.2  Concatenative Synthesis ............................................................................................................... 6 

3.2.1  Unit Selection Synthesis ........................................................................................................ 6 

3.2.2  Diaphone Synthesis ............................................................................................................... 7 

3.2.3  Domain‐Specific Synthesis .................................................................................................... 7 

3.3  Formant Synthesis ......................................................................................................................... 8 

3.4  Articulatory Synthesis ................................................................................................................... 9 

3.5  HMM‐Based Synthesis ................................................................................................................ 10 

3.6  Sine Wave Synthesis ................................................................................................................... 10 

4  Challenges ........................................................................................................................................... 11 

4.1  Text Normalization Challenges ................................................................................................... 11 

4.1.1  Homographs ........................................................................................................................ 11 

4.1.2  Numbers and Abbreviations ............................................................................................... 11 

4.2  Text‐to‐Phoneme Challenges ...................................................................................................... 11 

4.3  Evaluation Challenges ................................................................................................................. 12 

5  Speech Synthesis in Operating Systems ............................................................................................. 13 

5.1  Atari ............................................................................................................................................. 13 

5.2  Apple ........................................................................................................................................... 13 

5.3  AmigaOS ...................................................................................................................................... 13 

5.4  Microsoft Windows ..................................................................................................................... 13 

6  Speech Synthesis Markup Languages ................................................................................................. 15 

7  Applications ......................................................................................................................................... 16 

7.1  Contact Centers ........................................................................................................................... 16 

7.2  Assistive Technologies ................................................................................................................ 16 

Page 3: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 2 

 © Specialty Answering Service. All rights reserved. 

7.3  Gaming and Entertainment ........................................................................................................ 16 

8  References .......................................................................................................................................... 17 

Page 4: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 3 

 © Specialty Answering Service. All rights reserved. 

1 Introduction

Theword ‘Synthesis’ is defined by theWebster’s Dictionary as ‘the putting together of parts orelements so as to form awhole’. Speech synthesis generally refers to the artificial generation ofhumanvoice–eitherintheformofspeechorinotherformssuchasasong.Thecomputersystemused for speech synthesis is known as a speech synthesizer. There are several types of speechsynthesizers(bothhardwarebasedandsoftwarebased)withdifferentunderlyingtechnologies.Forexample,aTTS(Text toSpeech)systemconvertsnormal languagetext intohumanspeech,whilethereareothersystems thatcanconvertphonetic transcriptions intospeech.Thebasicprinciplebehind speech synthesis is known as the source‐filter theory of speech production – that is, theinformationaboutthevoicesourceandthevocaltractfiltercanbeseparatedfromeachother.

Today, speech synthesizers are a common feature in most operating systems. Speech synthesisapplicationshavealsomadecomputingrelatedservicesmoreinclusivebyallowingaccesstopeoplewithvisualimpairmentsorreadingdisabilities.

1.1 QualityofaSpeechSynthesizer

Thequality of a speech synthesizer ismeasuredbasedon twoprimary factors – its similarity tonormalhumanspeech(naturalness)and its intelligibility (easeofunderstandingby the listener).Ideally,aspeechsynthesizershouldbebothnaturalandintelligible,andspeechsynthesissystemsalwaysattempttomaximizebothcharacteristics.

1.2 TheTTSSystem

A typical text to speech system has two parts – a front end and a back end. The front end isresponsiblefortextnormalization(thepre‐processingpart)andtexttophonemeconversion.Textnormalizationor tokenization is thephasewherenumbersandabbreviations in the raw textareconvertedintowrittenwords.Texttophonemeconversionorgrapheme‐to‐phonemeconversionistheprocessofassigningphonetictranscriptionstoeachwordanddividingthemintoprosodicunitssuch as phrases, clauses, and sentences. The output of the front‐end system is the symboliclinguistic representationof the text. It is composedof thephonetic transcriptionsalongwith theprosody information. This output is then passed on to the back‐end system or the synthesizer,whichconvertsitintosound.Sophisticatedsynthesizerswouldalsohavetheoptiontocomputethetarget prosody – the pitch contours and phoneme durations –which can then be applied to theoutputspeechtoimprovethehumannatureofthespeechoutput.

Page 5: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 4 

 © Specialty Answering Service. All rights reserved. 

2 History

Early speech synthesis efforts revolved around attempts to createmachines that could generatehuman speech,much before electronic signal processingwas invented. Over the years, an ever‐increasingunderstandingofthemechanismofvoicehasledtosignificantdevelopmentsinthefieldofspeechsynthesis.

Oneoftheearliestsuccessfulattemptstoreproducehumanspeechwasin1779,whentheDanishscientistChristianKratzenstein,developedmodelsofthehumanvocaltractthatcouldproducethevowel sounds in the English language ‘a, e, i, o, u’ using various shaped tubes. The ‘acoustic‐mechanicalspeechmachine’developedbyWolfgangvonKempelen in1791wasan improvementoverthisandmanagedtoproduceconsonantsaswellasvowelsounds.

OneoftheearliestmanuallyoperatedelectricalapparatusesforspeechsynthesiswasconstructedbyJ.Q.Stewartin1922.In1924,asimilarmachinewasalsodemonstratedbyDr.HarveyFlecherofBellLabswhenhemanagedtoproducealimitedvocabularyofsounds,includingvowelsandwordssuchas‘mamma’and‘papa’.

TheVoderwasthefirstmajorattempttoencodespeechelectronically.WorkontheVoderstartedas early asOctober1928at theAT&TBell Laboratories. TheVoderwasbasedon a source‐filtermodelforspeechthatincludedanon‐parametricspectralmodelofthevocaltractproducedbytheoutputofafixedbandpass‐filterbank.In1948,WernerMeyer‐Epplerrecognizedthecapabilityofthe Voder machine to generate electronic music, as described in Dudley's patent. In the 1950s,MunsonandMontgomeryattemptedaparametricspectralmodelforspeechsynthesis.

SimultaneouslyDr. Franklin S. Cooper and his colleagues at Haskins Laboratories developed thePattern playback device, which converted a spectrogram plotting of the acoustic patterns intoartificialspeech.

Inthe80sand90s,therewasalotofresearchonspeechsynthesisatMITandBellLabs.TheBellLabs system was the first language‐independent system that made use of natural languageprocessingtechniquesforspeechsynthesis.

2.1 ElectronicDevices

Theearliestcomputerbasedspeechsynthesissystemswerecreatedinthelate50s.NorikoUmedaet al. developed the first general purpose English TTS system in 1968 at the ElectrotechnicalLaboratory,Japan.In1961,physicistJohnLarryKelly,Jr.andcolleagueLouisGerstmanusedanIBM704computertosynthesizespeech,whichwasaprominenteventinthehistoryofspeechrelatedresearchdoneatBellLabs.Despite the successofpurely electronic speech synthesis, research isstillbeingconductedintomechanicalspeechsynthesizers.

Handheld devices featuring speech synthesis emerged in the 70s. Early attempts towards usingspeech synthesis in devices were guided by the need for inclusive design and the TelesensorySystemsInc.(TSI)Speech+portablecalculatorfortheblind.Developedin1976,itwasprobablythe

Page 6: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 5 

 © Specialty Answering Service. All rights reserved. 

firstdevice tohaveembeddedspeechsynthesis.Otherearlyuses forembeddedspeechsynthesisincludededucationaltoyssuchasSpeak&Spell,producedbyTexasInstrumentsin1978,orgamessuchasthespeakingversionofFidelity’selectronicchesscomputerin1979.Thefirstvideogametofeaturespeechsynthesiswas the1980arcadegame,Stratovox fromSunElectronics, followedbythe arcade version of Berzerk. The first multi‐player electronic game using voice synthesis wasMiltonfromMiltonBradleyCompany,whichwasreleasedthesameyear.

In the early stages, electronic speech synthesizers produced robotic sounds that were hardlyintelligible. Today, there has been a lot of improvement in speech synthesis systems, but adiscerning listener can still distinguish the output of a speech synthesizer from actual humanspeech.

As hardware and storage costs become lower and computational power of devices increases,speech synthesizerswill play a key role in inclusive design, thusmaking computing applicationsaccessibletoalargerpopulationofpeople.

Page 7: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 6 

 © Specialty Answering Service. All rights reserved. 

3 SynthesizerTechnologies

Synthesized speech can be created by joining together speech elements that are stored in adatabase.Thesespeechelementscanbeasgranularasphonesanddiaphones,orashugeasentiresentences. The level of output range and clarity of a speech synthesis system are inverselyproportionaltoeachotherandaredependentonthesizeofthestoredspeechelements.Typically,aspeechsynthesizerthatmakesuseofstoredphoneticelementscangivealargeoutputrange,butwith lessclarity.Speechsynthesizersused forspecificdomainssuchasamedicalcallcenteroralegalfirm,aremorelikelytouseanentirevocabularyofspecificwordsandsentencestomaketheoutputqualitybetter.However,thenumberofsentencesthatthespeechsynthesizercouldproducewouldbelimitedtotheextentofitsstoredvocabulary.Ontheotherendofthespectrum,therearealsotruevoicesynthesizersthatcancreateartificialspeechfromscratchusingamodelofthevocaltractandreproducingthecharacteristicsofthehumanvoice.

Synthesizer technologies used to generate artificial speech can be broadly classified into twogroups – concatenative synthesis and formant synthesis. Both of these technologies have theiradvantages and disadvantages, and the choice of technology in any speech synthesis system isusuallybasedontheintendedenduse.

3.1 Waveform/SpectralCoding

Waveform coding refers to the modification of a sound wave by analyzing its characteristics,compressingit,andthenrestoringitbacktoitsoriginalform.Thistypeofsynthesisworksonlyonthe acoustic waveform and its spectral characteristics, and is not concerned with the physicalspeech production system. It can be applied in areas such as telephonic transmission. Humanspeech is analyzed and compressed into packets that are then transmitted and decoded at thereceiving end to synthesize speech for the listener. Other applications of this form of speechsynthesisincludevoicetransformation(e.g.alterationofthepitchordurationofthevoice)andforcreatingspeechdatabasesforconcatenativesynthesis.SomepopularwaveformcodingtechniquesincludeLPC(linearpredictivecoding)andPSOLA(pitchsynchronousoverlapapproach).

3.2 ConcatenativeSynthesis

Concatenative synthesis, as the name suggests, is based on the concatenation (or stringingtogether)ofpre‐recordedspeechsegments.Asaresult,thiscanproducethemostnaturalsoundingartificialspeech.However,theautomatedtechniqueusedforsegmentingthewaveformsgeneratedcanaffectthesuccessfulreproductionofvariationsinspeechandinflectionthatisapartofnormalspeech.Thisleadstoaudibleglitchesintheoutput.Therearethreemainsub‐typesofconcatenativesynthesis.

3.2.1 UnitSelectionSynthesis

Unitselectionsynthesis isdoneusingspeechdatabases,whichare largestorehousesof recordedspeech.Thedatabaseiscreatedtostorespeechunits–eachrecordedutteranceissegmentedintoindividual phones, diaphones, half‐phones, syllables, morphemes, words, phrases, or sentences

Page 8: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 7 

 © Specialty Answering Service. All rights reserved. 

withthehelpofacustomizedspeechrecognizer.Typically,amanualcorrectionofthesegmentsisperformedbasedon thevisual representationof therecordedspeech, suchas thewaveformandspectrogram. Each speech segment also includes other acoustic parameters such as the pitch,duration,positioninthesyllableandtheneighboringphones.Duringthecreationphase,thetargetutterance is recreated using a weighted decision tree algorithm that selects the best possiblecandidateunitsfromthedatabaseandstringsthemtogether.

Creatingartificialspeechusingunitselectionprovidesthemaximumamountofnaturalnesstotheoutput since it applies only a minimal amount of digital signal processing (DSP) to the humanspeechthatisrecorded.ThehighertheamountofDSP,thelessnaturalthespeechwillbe.However,mostsystemswilluseatleastaminimalamountofsignalprocessing,especiallytosmoothoutthewaveformatpointswherethespeechunitsareconcatenatedtoeachother.High‐endunit‐selectionsystemsaresogoodthattheoutputfromthesesystemsisoftennotanydifferentfromrealhumanvoices.ThisisespeciallytrueforTexttospeechsystems,whichhavebeenspecificallycreatedandfine‐tuned tohandlepre‐definedscenarios.However, the flipside is that inorder toproduce thebest output possible, unit selection systems need huge speech databases often running intogigabytes of recorded data. It is also time consuming to create these databases as they oftenrepresent several hours of recorded speech. The other negative aspect of these systems is therobustnessof theselectionalgorithm,whichoftenselectsa less thanclearunit, even if therearebetter choices available in the database. Newer systems rely on automatedmethods to identifysegmentsthatareunnaturalinthespeechsynthesissystemssoastointroduceacorrectionfactor.

3.2.2 DiaphoneSynthesis

This method uses a comparatively smaller database than unit selection systems. The databaseconsistsofallsound‐to‐soundtransitionsordiaphonesinthelanguagebeingmodeled.Thenumberof diaphones and consequently the size of the database varies based on the phonotactics of thelanguagechoice–Spanishforexamplehasonlyaround800diaphoneswhereasGermanhasthreetimesthenumber–around2500.Onesampleofeachdiaphoneisstored inthedatabaseandtherun‐timespeechgenerationisdonebysuperimposingthetargetprosodyofthesentenceofthesediaphone units with the help of DSP techniques such as PSOLA, MBROLA or linear predictivecoding.

The negatives of this technique include the sonic glitches caused by concatenation of very smallunits,andvoiceoutputthatisgenerallymoreroboticthannaturalsounding.Asstoragepricesdrop,the significant advantage of this technique of low database size is no longer a differentiator forcommercial applications and hence it is mostly used in free speech synthesis software and notmuchforcommercialpurposes.

3.2.3 Domain‐SpecificSynthesis

Domainspecificsynthesisisusedforspecializeddomainswheretheoutputvocabularyislimited–suchas inamedicalcall center,aweatherreport,orarailwayannouncement.Hereprerecordedwordsandphrases are combined together to create specific sentences.This is oneof the easiesttechnologies to implement in the area of speech synthesis and as a result has found several

Page 9: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 8 

 © Specialty Answering Service. All rights reserved. 

commercialusesincludingindevicessuchascalculatorsandclocks.Thebiggestadvantageofthesesystems is thehigh levelofnaturalnessdueto the largeunitsize, the limitedvocabularyneeded,andthesmallsetofcontexts,resultingineasymatchingofprosodyandintonation.

Ontheotherhand,theycannotbeusedasageneral‐purposespeechsynthesizerduetothelimitedvocabularystoredintheirdatabases.Theotherproblemwiththistechniqueisthecontextsensitivepronunciationdifferencesofthesameword,whicharedifficulttoreproduceusingthistechniquewithout additional logic being built‐in. For example, in French many final consonants that areusuallysilentinwordsarepronouncedwhentheyarefollowedbyawordthatbeginswithavowel.InEnglishaswell,the‘r’soundisonlypronouncedwhenthefollowingwordhasavowelasitsfirstletter.

3.3 FormantSynthesis

Formant synthesis is yet another technique of speech synthesis that is not based on a physicalmodel of the vocal folds andvocal tract. The formant synthesizer takes voice source parameterssuchasthefundamentalfrequencyandamplitudeofthevoicedsoundandattemptstoreproducethem. The effects of the vocal tract are specified as formant frequencies and their respectivebandwidths. Thus, the formant synthesizer attempts to replicate voiced sound by combining thesoundsourcesandfiltersandthenshapesthemintovowelsandconsonantsbyapplyingthevocaltractresonancesoverthevoicedsound.

Unlike concatenative synthesis, formant synthesis techniquesdonotusehuman speechunits forcreatingspeech. Instead, thespeechoutput isgeneratedusingadditivesynthesisandanacousticmodel. Artificial speech waveforms are generated by varying speech parameters such asfundamentalfrequency,voicing,andnoiselevelsovertimebasedonpre‐definedrules.Asaresult,this technique is also referred toas rules‐basedsynthesis.Most systemsbasedon this techniquegenerateartificialroboticsoundingspeechandhencenaturalnessisverylowforthiskindofspeechsynthesis. Despite this, formant synthesis does have several advantages over concatenativesystems.

Inaformantsynthesizertheusercancontrolallacousticcharacteristicsobservedinnaturalspeechsuchasthepitch,formantsetc.byplottingandeditingtheminaspectrographicdisplay.

The key advantage is that it is more intelligible than concatenative systems especially at highspeeds,thusmakingitusefulforapplicationssuchasscreenreadersforvisuallyimpairedpeople,where the purpose is not to replace the human speech. The other big advantage is the limitedstoragespacerequirements,asnospeechsampledatabaseisused.Itisanidealchoiceforspeechsynthesisinembeddedsystems,wherethelimitedmemoryandmicroprocessorpowerhavetobeoptimized.Itisalsousedforexperimentstostudytheperceptualrelevanceofacousticproperties.Thesesystemsarealsomuchmoreflexibleasallaspectsofoutputspeechcanbecontrolledusingarule base, thereby allowing various speech outputs. Systems can easilymodel a wide variety ofprosodies and intonations, making it easy to reproduce statements and questions and also tointroduceemotionsanddifferentvoicetones.

Page 10: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 9 

 © Specialty Answering Service. All rights reserved. 

Inthelate1970s,formantsynthesiswasusedinSpeak&Spell,aTexasInstrumentsgadgetwhereavery high level of intonation control was achieved. It is also used in gaming and entertainmentsystemssuchastheSegaarcademachinesandinmanyAtari,Inc.arcadegamesusingtheTMS5220LPCChips.FormantsynthesisisalsousedinTTSapplicationssuchasDecTalk.

3.4 ArticulatorySynthesis

Articulatory synthesis systems use computational techniques for synthesizing speech based onmodels of the human vocal tract and the articulation processes occurring there. Articulatorysynthesissystemsaremostlyusedforlaboratoryexperiments.ThefirstsuchsynthesizerknownasASY was developed in the 70s at Haskins Laboratories by Philip Rubin, Tom Baer, and PaulMermelstein. It was based on vocal tractmodels developed earlier at Bell Laboratories by PaulMermelstein,andCecilCoker.

Althougharticulatorysynthesismodelsaretypicallynotusedincommercialapplications,anotableexceptionistheNeXT‐basedsystem.Marketedinthe90s,itprovidedarticulatorysynthesis‐basedtexttospeechconversionusingatransmissionlinemodelofthehumanoralandnasaltracts.

Articulatory synthesis systems arebasedon an attempt to recreate thephysiologyof the speechproductionsystem–thusthesynthesizergivestheusercontroloverthepositions,movementsandother characteristics of speech producing organs such as the tongue, jaw, lips, vocal chords andeventherespiratorysystem.Thereareseveralsub‐systemsunderthisandtheyvarydrasticallyintermsofthesynthesizermodelaswellasthepracticalapplications.

Researchinthisfieldisalsodiversewithsomeresearchersfocusingonspecificsubsystemsofthespeechsystem–attemptingtocreatecomplexmathematicalmodelsofspecificorganssuchasthetongueorthejaw.Insuchresearch,thefocusisnotonre‐creationofspeech,butonthemechanicalmovement patterns of the speech organs. Other researchers focus on creating a simplemathematical model of all the articulators – attempting to understand how articulation affectsvariousacousticproperties.Yetanothermethodofsynthesisinvolvesdirectlyspecifyingthevocaltractshape,avoidingtheneedtodescribetheactualarticulators.

Theoretically,thecomplexsubsystemmodelscanbecombinedtoformacomprehensivemodelofthe speechproduction system,which can simulate the articulatormovement aswell as resultingacoustics.This ‘virtualhumanspeaker’wouldalsobeagreatresearchtoolthatcouldhelptotestmanytheoriesaboutspeechproductionandunderstanding.

Researchersbelievethatarticulatorysynthesiswouldsomedaybeabletocreatethemostnaturalsounding synthetic speech as it can incorporate all the physical aspects of a speech productionsystem.Thekeyroadblocktoobtainingnaturalsoundingspeechusingthistechniqueisthelackofunderstandingon the timingof the articulators andhow theycoordinatewitheachotherduringreallifespeechproduction.Asaresultofthistimingproblem,speechproducedbythearticulatorysynthesizersisoftenunintelligibleandunnatural.

Page 11: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 10 

 © Specialty Answering Service. All rights reserved. 

3.5 HMM‐BasedSynthesis

HiddenMarkovModels(HMMs)basedsynthesisusestatisticalparametrictechniquestogeneratethespeechoutput.Thefrequencyspectrum,thefundamentalfrequency,anddurationofspeecharemodeledsimultaneouslybyHMMstocapturethecharacteristicsofthevocaltract,thevocalsource,andtheprosodyofhumanspeech.Usingamaximumlikelihoodalgorithm,speechwaveformsaredirectlygeneratedfromtheHMMs.

3.6 SineWaveSynthesis

Inthismodel,speechsynthesisisdonebyreplacingthemainenergybandsorformantswithpuretonewhistles.

Page 12: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 11 

 © Specialty Answering Service. All rights reserved. 

4 Challenges

Thereareseveralchallengesfacedbyspeechsynthesistechniquesatvariousstagesoftheprocess.Someoftheseareenumeratedbelow:

4.1 TextNormalizationChallenges

4.1.1 Homographs

Textsaretypicallyfullofnumbers,heteronyms,andabbreviationsthatallneedtobeexpandedintoaphoneticrepresentation.Atthetextnormalizationstage,akeychallengeisposedbywordsthathave the same spelling but different pronunciations based on the context. Generating semanticrepresentationsofinputtextsarecomputationallyineffectiveandtheresultsareoftenunreliable.Hence,mosttext‐to‐speechsystemsrelyonheuristictechniquestoidentifytherightpronunciationforhomographsbasedon theneighboringwords,with thehelpof statistics suchas frequencyofoccurrence, or Hidden Markov Models (HMM). HMMs generate ‘parts of speech’ to helpdisambiguatehomographsandtheyhavebeenquitesuccessfulinmanycaseswithanerrorrateoflessthanfivepercent.Forexample,theycanhelptodecipherwhether‘read’appearsinthepresenttenseorpasttenseinasentence,andthusvarythepronunciationbetween‘reed’and‘red’.HiddenMarkovModelscanbeeffectivelyusedwithmostEuropeanlanguages,butmostmaynothavewell‐definedtrainingcorpora.

4.1.2 NumbersandAbbreviations

Convertingnumbers to speech isanotherchallenge thatTTSsystems face.Althoughconvertinganumbertowordsisastraightforwardprogrammingproblem,thechallengelies in identifyingthecontextsensitivespeechrequirements foranumber.Forexample,1234canbereadas “one twothreefour”or“twelvethirtyfour”or“onethousandtwohundredandthirtyfour”dependingonthecontext.Romalnumeralsalsofacethesamechallenge–EdwardVIIisreadas‘EdwardtheSeventh’whileChapterVIIisreadas‘ChapterSeven’.

TTSsystemsoftenidentifyhowtoexpandanumberbasedontheneighboringwordsandnumbersaswellasthepunctuationusedinthesentence.

Abbreviationscanalsobechallenging.Theabbreviation‘in’forincheshastobedistinguishedfromtheword ‘in’.Similarly, ‘St’canbeexpandedas ‘Street’or ‘Saint’dependingonthecontext.SomeTTS systems with built‐in artificial intelligence make educated guesses about ambiguousabbreviationsatthefrontend,whileotherscreatenonsensicaloutputsthatareoutofcontext.

4.2 Text‐to‐PhonemeChallenges

Text‐to‐phonemeconversionistheprocessofarrivingatthepronunciationofawordbasedonitsspelling. There are two basic techniques for text‐to‐phoneme conversion in speech synthesissystems–dictionary‐basedapproachandrule‐basedapproach.Inthedictionary‐basedapproach,alargedatabasecontainingallthewordsandtheirpronunciationsisstoredintheprogramandthe

Page 13: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 12 

 © Specialty Answering Service. All rights reserved. 

text‐to‐phoneme conversion is performed by a simple program that looks up each word in thedatabase and replaces the text with the pronunciation. The rule‐based approach makes use ofphoneticrules,whichareapplied towords toarriveat theirpronunciation.This is similar to thephoneticreadingapproachusedtoteachreadingtokids.

Bothoftheseapproacheshaveprosandcons.Thedictionary‐basedapproachisfasterandaccurateforwordsinthevocabulary;however,ifawordisnotpresentinthedatabase,itwillfail.Theothernegative is the hugememory size required for the dictionary database. Rule‐based systems aregenerallysmallerinsizeandhavetheadvantageofbeingabletoworkwithanyinput.However,thecomplexityoftherulescanmakethesystemprogrammaticallyverycomplex,especiallywhenyouhavetohandleirregularpronunciationsandspellings.

Most commercial speech synthesis systems use a combination of the dictionary‐based and rule‐basedapproachestoincorporatethebestofbothworlds.Thechoiceoftechniquealsodependsonthe type of language being modeled. Languages with a phonemic orthography (where theirpronunciations closely match their spellings) render themselves well to rule basedmethod andspeech synthesis systems, for such languages use rule‐based methods extensively with just anincremental dictionary being used for foreign names and other derived words where thepronunciations may differ significantly from their spellings. English language however is verydifferentwithirregularspellingsystemsandspeechsynthesissystems.Englishsystemsrelymoreon dictionaries and use rule‐based engines for handling exceptions, and for words that are notpresentinthedictionarydatabasebeingused.

4.3 EvaluationChallenges

Challenges also exist in evaluating speech synthesizers and it is not always easy to say that onesystemisbetterthantheother.Thisisprimarilybecausetherearenouniversallyagreedevaluationcriteriaforspeechsynthesizerperformance.Inaddition,differentapplicationsuselargelydifferentvocabularies and the quality of the speech synthesis output depends on the listener. Equipmentused to record and reproduce speechwill also affect the quality of the output generated. Todayhowever,mostspeechsynthesissystemsareevaluatedusingacommondataset.

Page 14: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 13 

 © Specialty Answering Service. All rights reserved. 

5 SpeechSynthesisinOperatingSystems

5.1 Atari

The first integration of speech synthesis systems with operating systems was done in the1400XL/1450XL personal computers designed by Atari. These machines used a Finite StateMachine to enable text‐to‐speech synthesis of the English language. However, these were not acommercialsuccess.

5.2 Apple

Applecameoutwiththefirstcommerciallysuccessfuloperatingsystemwithanintegratedspeechsynthesissystem–MackInTalksoftware.ThespeechsynthesissoftwarewaslicensedfromJosephKatz and Mark Barton and featured in Mac computers since the 80s. By the early 90s, Appleprovided system‐wide TTS support. As the PC revolution emerged, Apple started including highquality voice sampling and speech recognition into its systems thus providing full capabilitiesaroundspeech.Althoughitstartedasacuriosity,thespeechsystemofAppleMacintoshhastodayevolvedintoacompleteprograminitsownright–PlainTalk.Usersevenhavetheabilitytoselectfromawiderangeofvoices.

TheAppleiOSoperatingsystemusedonAppledevicessuchastheiPhone,iPadandiPodprovidesaccessibilityenhancementusingVoiceOverspeechsynthesis.TodaymanythirdpartyappsontheiPadandiPhonealsoprovidespeechsynthesistoenablewebpagebrowsingortranslatingtextfromaforeignlanguage.

5.3 AmigaOS

TheAmigaOS,introducedin1985,alsohadadvancedspeechsynthesiscapabilitiesmadeavailablewiththehelpoftheAmigahardwareaudiochipset.ThespeechsynthesissoftwareforthisOSwasalsolicensedtoSoftVoice,Inc.,thedevelopersoftheoriginalMackInTalktexttospeechsystem.Ithadadvancedfeaturessuchasmaleandfemalevoicesand ‘stress’ indicatormarkers.Thespeechsynthesissystemwasdividedintoanarratordeviceandatranslatorlibrary.TheAmigaOSdesignedspeechsynthesisasavirtualhardwaredevice,sotheusercouldevenredirectconsoleoutputtoit.

5.4 MicrosoftWindows

InWindows, speech synthesisandspeech recognitionare supportedbyusingSAPI4 andSAPI5components.Windows2000 includedNarrator, a text to speechutility for the visually impaired.Windowsdoesnotprovidesystem‐widetexttospeechcapabilities.Someprogramscanusespeechsynthesisdirectlywhileothersrelyonplug‐ins,extensions,oradd‐onstoreadouttext

Microsoft Speech Server is a server based voice package with speech synthesis and speechrecognition capabilities. It is aimed for network use with web applications and for use in callcenters.

Page 15: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 14 

 © Specialty Answering Service. All rights reserved. 

Today, thereare anumberof applications,browserplug‐ins andgadgets suchase‐book readers(AmazonKindle,SamsungE6),GPSnavigationsystems(Garmin,TomTom)andevenmobilephones(iPhone,SamsungGalaxy)thatincorporatespeechsynthesis.OnlineRSSnarratorscannarrateRSSfeeds,enablinguserstolistentotheirRSSfeedswhileonthemove.

Several non‐profit projects revolving around accessibility have speech synthesis incorporatedwithintheirofferings.ThePediaphonprojectcreatedin2006providesaweb‐basedTTSinterfaceto Wikipedia. Other web‐based assistive technology products include Browsealoud andReadspeaker,bothofwhichdeliverTTSfunctionalityovertheweb.

Page 16: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 15 

 © Specialty Answering Service. All rights reserved. 

6 SpeechSynthesisMarkupLanguages

Severalmarkuplanguagescanbeusedforspeechsynthesis.TheseXML‐compliantformatsincludeSpeech SynthesisMarkup Language (SSML), theW3C recommendation in 2004, aswell as oldermarkup languages such as JSML JavaSpeechMarkupLanguageandSABLE.Although eachof themarkuplanguageswasproposedasastandardwhentheywereinitiallyintroduced,noneofthesehas found widespread acceptability. Speech synthesis markup languages are different fromdialoguemarkuplanguagessuchasVoiceXML,whichincludetagsforspeechrecognition,dialoguemanagement,andtouchtonedialing.

Page 17: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 16 

 © Specialty Answering Service. All rights reserved. 

7 Applications

Speechsynthesisfindsapplicationsinthreebroadareas–contactcenters,assistivetechnologiesforpeoplewithdisabilities,andgaming.

7.1 ContactCenters

Speechsynthesiscombinedwithspeechrecognitionhasalsoimprovedtheinteractionbetweentheuserandmobiledeviceswiththehelpofnaturallanguageprocessinginterfaces.

7.2 AssistiveTechnologies

Speech synthesis solutions have played a vital role in assistive technologies by removingenvironmentalbarriersforpeoplewithdifferentdisabilities.Oneofthemostcommonapplicationsis the use of speech synthesis technology in screen readers for people with visual disabilities.Speech synthesis based tools also find applications in the area of helping people with learningdisabilities such as dyslexia. They are also used to engage pre‐literate children in multimediaapplications. Solutions for speech‐impaired people also include a speech synthesis modulecombinedwithdedicatedvoiceoutputhardware.Text‐to‐speechsolutionsfordisabledpeoplealsofindapplicationsinpublictransportsystemsandinotherareas,integratingthemintodailylife.

7.3 GamingandEntertainment

Speech synthesis techniques are also extensively used in games and animation movies. Keysoftware that has been developed specifically for the entertainment industry was the softwareapplication package created in 2007 by Animo Limited. It is based on their speech synthesissoftware, FineSpeech. This application package is specifically geared towards use in theentertainmentindustrywithcapabilitiestogeneratenarrationanddialoguesbasedonspecificuserinputs.

Page 18: Speech Synthesis v2 - Specialty Answering Service€¦ · used for speech synthesis is known as a speech ... vowel sounds in the English ... represent several hours of recorded speech.

 17 

 © Specialty Answering Service. All rights reserved. 

8 References

1. http://en.wikipedia.org/wiki/Speech_synthesis

2. http://sal.shs.arizona.edu/~bstory/synthesis