Top Banner
VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio
28

VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Jan 03, 2016

Download

Documents

Prudence French
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

VoiceXML:

SSML (Speech Synthesis Markup Language)Recorded speech and audio

Page 2: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Acknowledgements

Prof. Mctear, Natural Language Processing, http://www.infj.ulst.ac.uk/nlp/index.html, University of Ulster.

Page 3: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Overview

Speech Synthesis Markup Language (SSML) Phases of Text to Speech Synthesis

Structure analysis Text normalisation Text to phoneme conversion Prosody analysis Waveform production

Recorded speech

Page 4: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

SSML

Speech Synthesis Markup Language enables developers to override default specifications

Stages: Structure analysis Text normalisation Text to phoneme conversion Prosody analysis Waveform production

Page 5: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Structure Analysis

Division of text into basic elements e.g. sentence, paragraph to support more natural phrasing <s> - sentence <p> - paragraph

Structure inferred from punctuation and formatting, but … Dr. Lewis works at the clinic on Sunset Dr. in western

Portland. Dr. Smith lives at 214 Elm Dr.  He weighs 214 lb. He plays

bass guitar.  He also likes to fish; last week he caught a 20 lb. bass.<p>    <s>Dr. Smith lives at 214 Elm Dr. </s>    <s>He weighs 214 lb.</s>     <s>He plays bass guitar. </s>     <s>He also likes to fish; last week he caught a 20 lb. bass.</s></p>

Page 6: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Text Normalisation

Annotation of text so that it is spoken correctly

Ambiguous examples:1/2 - may be spoken as “half,” “January second,” “February

first,” or “one of two.”  Dr. – may be ‘doctor’ or ‘drive’ e.g. Dr. John Dr.” is rewritten as

“Doctor John Drive” St. – may be ‘saint; or ‘street’ e.g. St. John St. is written as

“Saint John Street.”

Acronyms e.g. ACM or IEEE should be spelled out, others are pronounced as words e.g. RAM, ROM

Email addresses: e.g. [email protected] part: “Cat Azman,” “C.A.Tazman,” or “C. Atazman?”  Last part: “Bee dot com” or “B.E.E. dot com?”

Page 7: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<sub>

New in VoiceXML 2.0. Speech Synthesis Markup. Syntax

<sub alias="substituteText" > OriginalText </sub> Description

Language element whose alias attribute provides substitute text to be spoken instead of the contained text. This allows the document to contain both a written and a spoken form for a string

Page 8: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<sub>

<sub alias ="doctor">Dr. </sub>

Smith lives at 214 Elm<sub alias = "drive">Dr.

</sub>

He weighs 214 <sub alias = "pounds"> lb. </sub>   

He plays bass guitar.   

He also likes to fish; last week he caught a 20 <sub alias = "pound"> lb. </sub> bass.

<sub alias = "doctor">Dr.</sub> Smith lives at <sub alias = "two fourteen

">214 </sub> Elm <sub alias = "drive">Dr.

</sub>   

He weighs <sub alias = "two hundred and fourteen">214 </sub> <sub alias = "pounds"> lb.</sub>  

He plays bass guitar.

He also likes to fish; last week he caught a <sub alias = "twenty">20 </sub><sub alias = "pound"> lb. </sub> bass.  

Page 9: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<say-as>

Speak enclosed text in the given style Implemented (with limitations) in some platforms Example: numbers

Contained text can be interpreted as a number. The allowed number formats are ordinal, cardinal, and digits.

<say-as type="number:ordinal">12</say-as> is spoken as "twelfth“

<say-as type="number:digits">12</say-as> is spoken as "one two".

Other types: acronyms, currency, time, date, duration, measures, telephone, spell-out, names, and net.

Bevocal provides a set of extended tags for items such as: airline, equity, street, city, state, citystate, address

Page 10: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Text to phoneme conversion

Specify pronunciation of words that are difficult to pronounce, e.g.read = ‘reed’ / ‘red’wind: Wind the watch when you face into the wind

<phoneme> - uses the standard phonetic alphabet, the International Phonetic Alphabet (IPA). 

He plays        <phoneme alphabet = "ipa" ph="U0062 U0258 U0073"> bass </phoneme> guitar.

He also likes to fish; last week he caught a <sub alias = "twenty">20 </sub>        <sub alias = "pound"> lb. </sub>        <phoneme alphabet = "ipa" ph="U0062 U00E6 U0073"> bass </phoneme>.

Unicode numbers

Page 11: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Attributes of <phoneme> alphabet—The phonetic alphabet used to specify the

pronunciation of the word contained in the <phoneme> element

ph—The phonetic spelling of this word expressed using the alphabet. The only valid values for this attribute are ph="ipa" and vendor-defined strings of the form ph = "x-organization" or ph = "x-organization-alphabet ".

Using the IPA requires some linguistic training.  For an excellent tutorial on the IPA symbols and sounds, see http://www.unil.ch/ling/english/phonetique/table-eng.html. 

For an overview of the IPA and a full chart of symbols, see http://www.arts.gla.ac.uk/IPA/ipa.html. 

The sounds used in English and their IPA symbols are illustrated in http://www.antimoon.com/how/pronunc-soundsipa.htm. You can hear each sound by clicking the word that contains the sound. 

To identify the corresponding Unicode number, go to http://web.uvic.ca/ling/resources/ipa/charts/unicode_intro.htm, move the cursor above the IPA symbol, and the Unicode value will appear.  

Page 12: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Prosody analysis

Pitch (intonation or melody), timing (rhythm), pauses, speech rate, emphasis on words, and the relative timing of segments and pauses. 

most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech.  For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed. 

Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory. 

Prosody rules and algorithms are not perfect and are a topic of ongoing research.  Prosody rules for different spoken national languages may be quite different.  For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different. 

Page 13: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<prosody> : pitch

refers to the “highness or lowness” of speech (currently not implemented in bevocal cafe) measured by the frequency (Hz, vibrations per second)

of the sound can be specified with:

A number followed by “Hz” A relative change expressed as a percentage:  for

example, "+18.2%" or "-10.3%" A relative change as a relative number: for example,

"+10" or "-8.7" One of the following words: "x-high", "high",

"medium", "low", "x-low", or "default"

Page 14: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<prosody> : range

Range - specifies the variability of the pitch.  specified using the same options as pitch e.g. (currently not implemented in bevocal cafe)

<prosody pitch = "medium" range = "x-low">     

Page 15: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<prosody>: contour

describes the actual pitch contour for the text.  (currently not implemented in bevocal cafe) set of time segments with a target pitch specified for each time

segment.  Each time segment is defined as a percentage of the total time

for speaking the contained text e.g. (25%, 25%, 25%, 25%) would speak the contained text in four equal segments. 

An interpolation algorithm smoothes the transitions between the time segments.  For example, a contour can be used to describe the increase in pitch at the end of a question as follows:

<prosody contour = "(90%, medium) (10%, high)"> You said what? </prosody>

Page 16: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<prosody> : rate, duration

Rate.  The speaking rate expressed using words-per-minute (currently not implemented in bevocal cafe), specified using any of the following: A number A relative change expressed as a percentage;  for

example, "+18.2%" or "-10.3%" A relative change as a relative number; for example, "+10"

or "-8.7" One of the following words: "x-fast", "fast", "medium",

"slow", "x-slow", or "default" The student’s name is <prosody rate=“-10%"> John Scott </prosody>

Duration.  A value in seconds or milliseconds for the desired time to read the element contents e.g.

<prosody duration = "10s">

Page 17: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<prosody> : volume

Volume.  Specifies how loudly or quietly the words are spoken, specified by: A number in the range from 0.0 to 100.0 A relative change expressed as a percentage  for

example; "+18.2%" or "-10.3%" A relative change as a relative number; for example, "+10"

or "-8.7" One of the following words: "loud", "medium", "soft", "low",

"x-soft", or "silent"

<prosody volume = "loud"> text to be spoken  </prosody>

Page 18: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<emphasis>

formerly <emph> level: values “strong” “moderate,” “none” and

“reduced”.  “none” used to prevent the speech synthesis processor

from emphasizing words that it might typically emphasize

<emphasis level = "strong">help</emphasis>  

Page 19: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<break>

specifies when to insert silence (or pause) in text strength - the strength of the prosodic break.  Values

are "none" "x-small", "small","“medium" (the default value), "large", or "x-large"

time – e.g. "250ms", "3s".

Welcome to the Student System

<break time = "250ms"/>

Please say one of the following: …

Page 20: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Waveform Production

Process of converting a textual representation to acoustical sounds which humans hear and interpret as human-like speech.

<voice> - uses a different voice from the default specified for TTS

<voice age=“3" gender="female"> text to speak </voice>

<audio> - specifies what audio to present to user

<desc> - specifies text-only output describing the audio output (e.g. dog barking)

Page 21: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

<audio>: playing prerecorded audio files

Output can consist of a combination of prerecorded files, audio streams, or synthesised speech e.g.<prompt>Welcome to the Student System <audio src = “AudioSample.wav” />How can I help you?</prompt>

<audio> can have alternative content in case the audio sample is not available e.g.<audio src = “welcome.wav” > Welcome to the Student System </audio>

Page 22: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Recording speech input using <record>

<record> is a form element similar to <field> It is used to collect a recording from the user that

can be played back or submitted to a server It has a <prompt> element and can have a

<filled> element It can have a grammar for a spoken command to

terminate the recording

Page 23: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Attributes of <record>

name - The name of a variable that holds the value of the recorded item.  expr - The value of the recorded item variable.  beep—There are two possible values: beep = "true" and beep = "false" If

true, a beep tone is presented to the user just before the recording begins.  The default is false.

maxtime—The maximum duration of the recording, beginning when the recording starts. For example, maxtime = "10s" where "10s" means 10 seconds. 

finalsilence—The interval of silence indicating the end of speech.   For example, finalsilence = "3s" (not implemented in IBM Voice Server SDK)

dtmfterm—There are two possible values: dtmfterm = "true“ and dtmfterm = "false" If true, then any DTMF key press not matched by an active grammar will terminate the input. The default is true. 

type—Media format of the resulting recording.  A media type is a file format written in the form type/subtype.  For audio files, the type is always audio. 

Page 24: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Example using <record>

<form><record name = "msg" beep = "true" maxtime = "5s”

finalsilence = "5000ms" dtmfterm = "true" type = "audio/x-wav”>

<prompt timeout = "5s">Record your message after the beep.</prompt></record><filled><!-- when recording is completed, replay recorded message

–-><prompt> You said <audio expr="msg"/> </prompt></filled></form>

Page 25: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Submitting recording to the server

In this example, a recording has been stored in the variable ‘msg’ and the system confirms if the user wishes to keep it:

<field name="confirm“ type = “boolean”> <prompt> Your message is <audio expr="msg"/>. </prompt> <prompt> To keep it, say yes. To discard it, say no. </prompt> <filled> <if cond="confirm"> <submit next="save_message.jsp" enctype="multipart/form-data" method="post" namelist="msg"/> </if> <clear/> </filled> </field>

Page 26: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Dealing with user hang up during recording

When a user hangs up during recording, the recording terminates and a connection.disconnect.hangup event is thrown. Audio recorded up until the hangup is available through the <record> variable e.g.

<catch event=“connection.disconnect.hangup”>

… action such as submit recording to server…

</catch>

Page 27: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Exercise: SSML markup

Create a file using some SSML markup for TTS.

Examples:

He drove his new car, <prosody pitch="-10%" range="-20%" volume="-20%">not his ugly old car</prosody>, because he wanted to seem more <emphasis level=“strong”> impressive </emphasis>

My user number is <say-as interpret-as=“digits”> 145678 </say-as>

Sample file: tts.vxml

Page 28: VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.

Exercise: recording and using audio files

Create a simple application that includes a field in which you ask the user to speak some information, such as name and address, that is recorded by the system for later playback.

Play back a pre-recorded file (music to be played as introduction)