Top Banner
IBM Labs in Haifa © 2004 IBM Corporation Embedded Concatenative Text-to-Speech Ron Hoory, Zvi Kons, Dan Chazan, Slava Shechtman Media Services and Technologies Group October 14, 2004
20

Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

Aug 04, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa © 2004 IBM Corporation

Embedded Concatenative Text-to-Speech

Ron Hoory, Zvi Kons, Dan Chazan, Slava ShechtmanMedia Services and Technologies GroupOctober 14, 2004

Page 2: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation2

Why Text-to-Speech ? Why Concatenative ?

� Text-to-speech eliminates the need to prerecord all possible messages� The alternative - recorded prompts - has much less flexibility:

� Cannot synthesize words/phrase outside inventory� Adding new prompts is expensive� No expression: prosody cannot be controlled –

especially when combining prompts:

� Text-to-speech (TTS) can synthesize arbitrary text

� In Concatenative text-to-speech (CTTS), small segments of speech are selected from a large speech database and concatenated together

Page 3: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation3

The role of TTS in a conversational system

� Critical component of the conversational interface� Only way to present information in eyes-busy/unavailable situations

(car, phone)� Quality of conversation system often equated to TTS quality

Dialog Manager

Natural Language Understanding

Speech Recognition

Natural Language Generation

Speech Synthesis

voice text ‘meaning’

‘meaning’textvoice

Page 4: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation4

How does a CTTS system work ?

Normalization

Text to Unit Conversion

Text to ProsodyTargets

Segment Selection

Post-SearchModification

text Prosodymodels

Database

We visited Rodeo Dr. We visited Rodayo Drive.

speech

Front-End

Back-End

Page 5: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation5

Language dependency and other considerations

� The front-end is mostly language dependent, relying on languages rules, pronunciation dictionaries etc.

� The back-end is mostly language independent, except for the speech database (a.k.a., “voice”)

� The voice needs to be recorded:� In the desired language and accent, e.g., “Canadian French”� With the desired speaker (“voice talent”) :

� male/female� low/high pitch� slow/fast speaking rate

� In a professional recording studio and equipment� With sampling rate above the target sampling rate (usually 22KHz)

Page 6: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation6

Text Normalization

� Language-independent text cleaning (html tags, etc.)� Language-dependent normalization for dates, time, numbers, currency,

phone numbers, addresses, abbreviations

� Examples:� St. Martin St. becomes Saint Martin Street� Dr. King Dr. becomes Doctor King Drive� 1 oz. becomes one ounce� 2 oz. becomes two ounces� $5 million becomes five million dollars

Page 7: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation7

Possible Concatenation Units

� Words� Syllables� Demi-syllables� Diphones� Augmented diphones� Phone Units� Subphone Units

Page 8: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation8

Concatenation Units in the IBM CTTS system

� HMM state-sized segments (3 states per phone)

� Segments are classified according to their phonetic context:� Phonetic context determined by a binary decision

tree with questions on neighboring phones� Segments are labeled according to the

leaves of the context dependent decision tree.� Typically 10-20 database occurrences per leaf label

� Text is first converted to phones using a pronunciation dictionary and then to leaves

S1 S2 S3

L3L2

L1

Page 9: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation9

Prosody modeling

� Prosody is critical for obtaining the right pronunciation and intonation� Wrong prosody can cause speech to sound unnatural or even

unintelligible

� Prosody targets typically include:� Pitch� Phone durations� Energy

� Prosody parameters can be trained to match the target speaker prosody

Page 10: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation10

How can prosody effect naturalness

� Expressiveness is a very important factor in speech naturalness.Controlling prosody can generate expression

� Neutral prosody:

� Expressive prosody:

������������������������������

���������� ��������������������

Page 11: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation11

Segment Selection and Post-Search Modification

� Each segment is selected from the all database candidates labeled with the target leaf label

� Dynamic programming used to optimize the series of segments selected by minimizing a cost function

� Cost function weights:� Proximity to prosody targets (pitch, duration)� Continuity between consecutive segments chosen

� Spectrum continuity � Pitch continuity

� Post-search modifications carried out to modify the pitch, duration and energy to match the target prosody

Page 12: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation12

IBM high quality CTTS with super-voices

� Building of CTTS voices includes:

� Voice recording using a predefined script

� Limited manual work for “cleaning”

� Intensive automatic processing

� IBM Super-voices

� Very large recording script:

� Usually 10000 sentences are read by the speaker� 15 hours of audio, 11 hours of speech excluding silence

� Script reflects typical scenarios

� Professional recording studio and professional speakers

� Three stage audition process of final voices

Page 13: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation13

Footprint and environments

� Size of the voice dataset is a crucial parameter for quality andnaturalness

� The CTTS system can operate in various environments;• Server : typical footprint of 500-1000MB• Desktop : typical footprint of 50-100MB• Embedded : typical footprint of 5-10MB

� The Embedded concatenative text-to-speech (eCTTS) challenge:Can we reduce the size of the voice by two orders of magnitude without severely degrading the quality ?

Page 14: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation14

Why eCTTS ?

� Server based CTTS requires a connection (wired or wireless) to the server, which is not always available

� Device manufacturers and car manufacturers usually prefer embedded applications running locally

� Even with growing amount of resources available on embedded devices and in-car systems, small footprint eCTTS is required:� Memory and processing power are important factors for the price of

embedded devices� Typically, the system includes many other components� Sometimes several languages should be supported

Page 15: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation15

eCTTS in the automotive market

� IBMS’s eCTTS part of the Honda speech interface in 2005 high-end cars� Embedded Viavoice includes embedded TTS and speech recognition,

providing a full conversational system� Main usage is for navigation applications

Page 16: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation16

How does eCTTS work ?

TTS

Front

Endleaves Segment

Selection

Segmentadjustment &

Concatenation

Feature

Reconstruction

Speech Dataset

Feature vectors

Features

Pitch

Energy

Durationspeech

Page 17: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation17

How is the x100 size reduction achieved ?

� Reducing the number of segments by segment preselection� Reusing of the same data for several purposes� Using a more efficient speech model� Data compression

Voice dataset

���MB�MB

Page 18: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation18

Segment Preselection

� The process:1. A voice dataset is built with all the segments in place (typically 1M

segments)2. A large number (100K) of sentences are synthesized and the

selected segments statistics is collected.3. A fraction of the segments that were the most frequently selected is

chosen.

� Typically 7-10% are chosen, resulting in ~100K segments and a coverage of 70-80% of the selections made during the statistics collection.

Page 19: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation19

Speech Model

� Frequency domain sinusoidal model, with amplitude and phase computed for every pitch harmonic (voiced frames)

� Accurate representation of the spectral envelope that is used in segment selection, pitch modification and reconstruction

Spectral envelope

frequencypitch

ijieA �

Page 20: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation20

Demonstration

German

Italian

UK English

US English

Example *Language

* All voices are 22KHz/10MB except the German male which is 11KHz/8MB