Top Banner
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing EE2F1 Multimedia (1): Speech & Audio Technology Lecture 7: Speech Synthesis (1) Martin Russell Electronic, Electrical & Computer Engineering School of Engineering The University of Birmingham
28

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 1

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

EE2F1Multimedia (1): Speech & Audio

Technology

Lecture 7: Speech Synthesis (1)

Martin RussellElectronic, Electrical & Computer Engineering

School of EngineeringThe University of Birmingham

Page 2: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 2

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Stages in “text-to-speech” synthesis Text normalisation Text-to-phone conversion Linguistic analysis Semantic analysis Conversion of phone-sequence to sequence

of synthesiser control parameters Synthesis of acoustic speech signal

Page 3: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 3

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Approaches to synthesis

Final stage is to convert ‘phone’ or word sequence into a sequence of synthesiser control parameters

Two main approaches:

– Waveform concatenation

– Model-based speech synthesis (inludes articulatory synthesis)

Page 4: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 4

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Waveform Concatenation Join together, or concatenate, stored sections of

real speech Sections may correspond to whole word, or sub-

word units Early systems based on whole words

– E.G: Speaking clock - UK telephone system, 1936

Storage and access major issues Speech quality requires data-rates of 16,000 to

32,000 bits per second (bps)

Page 5: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 5

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

1936 “Speaking Clock”

From John Holmes, “Speech synthesis and recognition”, courtesy of British

Telecommunications plc

Page 6: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 6

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Whole word concatenation (1)

Whole word concatenation can give good quality speech (as in speaking clock), but has many disadvantages:– pronunciation of a word influenced by

neighbouring words (co-articulation)– prosodic effects like intonation, rate-of-speaking

and amplitude also influenced by context.– interpretation of a sentence will be strongly

influenced by details of individual words used (“Mary didn’t buy Sam a pizza”)

Page 7: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 7

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Whole word concatenation (2)

Disadvantages (continued):– words must be extracted from the right sort of

sentence– most suitable for applications where structure of

the sentence is constrained, e.g., announcements, lists…

– may need to record more than one example of each word, e.g., raised pitch at end of a list, pre-pause lengthening…

Page 8: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 8

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Example – original recording

The next train to arrive at platform 2 will call at Bromsgrove, Droitwich Spa, Worcester Foregate Street and Malvern Link

Page 9: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 9

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Example – trivial concatenative synthesis

The next train to arrive at platform 2 will call at Malvern Link, Worcester Foregate Street, Droitwich Spa and Bromsgrove

Page 10: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 10

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Example repeated

Original recording ‘Concatenative synthesis’

Page 11: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 11

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Whole word concatenation (3)

Disadvantages (continued):– to add new words the original speaker must be

found, or all words must be re-recorded– even with specialist facilities, selection and

extraction of suitable words is labour intensive and time consuming

Page 12: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 12

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Sub-word concatenation (1)

Limitations of word-based methods suggest concatenative speech synthesis based on sub-word units

Need well-annotated, phonetically-balanced corpus of speech recordings

Extract fragments from waveforms in the corpus which represent ‘basic units’ of speech, and can be concatenated and used for speech synthesis

Page 13: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 13

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Sub-word concatenation (2)

Difficulties include:– identification of a set of suitable units– careful annotation of large amounts of data– derivation of a good method for concatenation

Page 14: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 14

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Sub-word concatenation (3)

Sub-word concatenation overcomes difficulties with adding new words to the application vocabulary,

But, other problems exacerbated. In particular, coarticulation and pitch

continuity problems occur within, as well as between, words.

Necessary to use several examples of each phone (corresponding roughly to different allophones).

Page 15: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 15

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Sub-word concatenation (4)

Natural to select fragments that characterise the phone target values, but modelling transitions between these targets is a significant problem

Page 16: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 16

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Example: sub-word concatenation

“stack” (original)

“task” sub-word concatenative synthesis

Page 17: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 17

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Transitional units (1)

Central regions of many speech sounds are approximately stationary and less susceptible to coarticulation effects.

Hence select fragments which characterise transitions between phones, rather than phone targets.

e.g., diphone - transition between two phones.

Page 18: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 18

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Transitional units (2)

There are contextually-induced differences between instantiations of the central region of phone, which cause discontinuities if they are not attended to.

Possible solutions are:– use several different examples of each

diphone– store short transition regions, and– interpolate between end values

Page 19: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 19

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Transitional units (3)

Coping with coarticulation effects by modelling transitions and– (a) using multiple examples to cope with variation in the

instantiation of the phone centres, and– (b) by interpolation between short transition regions

Page 20: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 20

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

More on prosody

Discontinuity in the fundamental frequency exacerbated for sub-word methods.

Can use source-filter model to separate-excitation signal from vocal-tract shape.

Vocal-tract shape descriptions can then be concatenated and an appropriately smooth fundamental frequency pattern can be added separately.

Page 21: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 21

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

PSOLA: Pitch Synchronous Overlap and Add PSOLA (Charpentier, 1986) Most successful current approach to

concatenative synthesis In PSOLA, the end regions of windowed

waveform samples are overlapped pitch-synchronously and added

BT’s Laureate is an example

Page 22: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 22

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

PSOLA

From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

Page 23: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 23

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Speech modification using PSOLA In addition to speech synthesis from

segments, there are two other common applications of PSOLA:– Pitch modification– Duration modification

Page 24: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 24

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Increasing pitch using PSOLA

From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

Page 25: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 25

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Decreasing pitch using PSOLA

From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

Page 26: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 26

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

The ‘Laureate’ System

The BT “Laureate” system is a modern, PSOLA-based synthesiser

See Edington et al. (1996a), also look at the web site

Demonstration

Page 27: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 27

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

PSOLA strengths and weaknesses Strengths

– Produces good quality speech

Weaknesses– Large, annotated corpus needed for each ‘voice’– Requires accurate pitch peak detection– Inflexible – new voices can only be produced by

recording and labelling significant speech corpora from new speakers

Automatic annotation of corpora using techniques from speech recognition

Page 28: EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

EE2F1 Speech & Audio

Technology

Sept. 26, 2002

SLIDE 28

THE UNIVERSITY OF BIRMINGHAM

ELECTRONIC, ELECTRICAL &

COMPUTER ENGINEERING

Digital Systems&

Vision Processing

Summary

Concatenative speech synthesis Whole word concatenation Importance of prosody Sub-word concatenation Choice of sub-word units PSOLA