Analysis and Modelling of Speech Prosody and Speaking Style filedescription level symbol description prominence P prosodic prominence local phonetic variations l/L low pitch m/M middle

1/42

Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion

Analysis and Modelling ofSpeech Prosody and Speaking Style

Nicolas Obin

Sound Analysis and Synthesis Department

IRCAM - CNRS - UMR 9912 - STMS

23 June 2011

Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS

MeLos

2/42


Introduction

Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses

Continuous ModellingStylization and Trajectory Modelling of F0 contours

Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style

Conclusion & Further Directions


MeLos

3/42


Text-To-Speech Synthesis

Current Methods� Unit selection [Hunt and Black, 1996]� HMM-based [Yoshimura et al., 1999]: model

speech characteristics based on parametricstatistical methods

Current State� intelligible �

Limitations� natural

� variety

Improvement� modelling speech prosody


MeLos

4/42


Speech Prosody

Definition

� Suprasegmental variations of speech - “the music of speech” (e.g., intonation,

accent)

� Conveys meaning, emotions, intentions, and many information about the

background of a speaker

� Vocal signature of a speaker, contribute as a part of his identity


MeLos

5/42


Speech Prosody

Continuous Characteristics

� fundamental frequency

� duration

� intensity

� voice quality

� articulation degree

Discrete Characteristics

� Description of relevant

prosodic events

1. accent

2. boundaries (e.g., pause)

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bOn9R

50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS

description level symbol description

prominence P prosodic prominence

localphonetic variations

l/L low pitchm/M middle pitchh/H high pitch

globalphonetic variations

R resetD downstep

phonological

toneL*/L low pitch accentH*/H high pitch accent

+

modifier∧ upstep! downstep> propagation

frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone

%/%

Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.

phonologicalglobal phonetic * *local phonetic * * *prominence % * P *

syllable Long- temps ## je me suis couche de bonne heure ##

sentence Longtemps , je me suis couche de bonne heure .

Table 3.5: Illustration of the text-to-prosodic-structure conversion.

LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%


MeLos

6/42


Levels of Speech Communication

HEY PATRICK!

phonetics

phonology

morphology

syntax

semantics

pragmatics

acoustic signal

SPEAKER

LISTENER

SPEECH:PROSODY

TEXT:ABSTRACTLINGUISTICLEVELS


MeLos

7/42


Major Contributions of this Work

Statistical Modelling of Speech Prosody

1. Statistical modelling of discrete (position of prosodic events) and continuous (F0variations, duration) characteristics of speech prosody

2. Combination of syntactic and metric constraints to assign pauses3. Stylization and trajectory modelling of F0 contours

Integration of a Rich Linguistic Description

4. Use of deep syntactic parsing to model speech prosody characteristics

Application to Speaking Style Modelling

5. Study on the ability of listeners to identify a speaking style6. Reference for the evaluation of speaking style modelling7. Ability of discrete/continuous HMMs to model the characteristics of a speaking

style


MeLos

8/42


Statistical Framework Used in this Work

Hidden Markov Models� commonly used in speech recognition/synthesis

Used in Speech Synthesis� discrete/continuous HMMs [Black and Taylor, 1994, Yoshimura et al., 1999]

Paradigms� modelling speech characteristics� modelling variability due to the context (e.g. phonemic, lexical, syntactic, or even

semantic)� context clustering: modelling contexts that are acoustically relevant


MeLos

9/42


Introduction






MeLos

10/42


Introduction

Objective

Determine the position of prosodic events (accent, pause) from a text

Linguistic Studies

have pointed out that speech prosody is partially constrained by:

� syntactic constraint [Selrik, 1984]

� metric constraint [Liberman and Prince, 1977]

Issues

� Modelling syntactic constraint

� Modelling metric constraint

� Combining syntactic & metric constraints

Contributions of my Work

� Integration of a deep syntactic description

� Use of Segmental HMMs + Dempster-Shafer Fusion


MeLos

11/42


Transcription of Speech Prosody Used in this Work1

Specificities of French prosody

� French: syllable-based language (vs. English: stress-based language)

� prosodic prominences are primary cues used to segment speech into

syntactico-semantic units

� prosodic boundaries have a central role in French prosody

My contribution for the

transcription of French Prosody

proposed alternative to the TOBI

standard [Silverman et al., 1992]

� major prosodic boundary (FM)

� minor prosodic boundary (Fm)

� accent (P)







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%

3.3. PHONOLOGICAL REPRESENTATION OF SPEECH PROSODY 51

%H

H%

L*

L%

FM

Fm

frontier

3.3.4 Rhapsodie

The Rhapsodie is a transcription system that has been developed in the Rhapsodie Project(Rhapsodie: Reference Prosody Corpus of Spoken French) [Lacheret et al., 2010].

The Rhapsodie transcription system intends at providing a simple and unified transcriptionground that can be shared among the existing phonological theories and description systems.The description of the prosodic variations is based on the perception of prosodic objectsthat are implictely shared among the phonological theories, such as prosodic prominence andprosodic packaging. The prosodic prominence is defined as an acoustic saliency, and coversprosodic events that are marked by intonation or by any other acoustic cue. The perceptualdescription of prosodic variations present several advantages over more sophisticated systems.First, a perceptual description does not require expert knowledge, and can be processed bymoderately-trained individuals. Second, the transcription can be easily integrated into mostof the existing models for further phonetic and phonological descriptions. In particular,the perceptual level provides a minimal description of the prosodic events that can be used toprecise and describe the acoustic dimensions that may be phonetically and phonologically relevant.

The minimal prosodic unit used for the description is the syllable, and the maximal prosodic unitis the prosodic group. The transcription of prosodic events is based on the perception of prosodicprominences (P) and prosodic packages (minor and major prosodic frontiers). The transcriptionis processed recursively to account for the hierarchical organization of prosodic events. For thispurpose, a variable temporal resolution is used to manage the relativity in the perception of prosodicprominences and to refine gradually the prosodic description. First, a segmentation into prosodicgroups (PGs) is achieved within a large integration domain (typically 5-10 s. depending on thespeaking style). The segmentation is based on the perception of a major prosodic prominence thatis associated with the end of a prosodic package. Second, a segmentation into internal prosodicgroups (IPGs) is achieved within each prosodic group. The segmentation is based on the perceptionof a minor prosodic prominence that are associated with the end of a prosodic package. Finally,residual prosodic prominences (P) are identified as the remaining perceived prosodic prominencesthat occur within the internal prosodic group. Additionnal symbols are used to manage uncertaintyand underspecification on the presence and nature of a prosodic frontier, and on the presence of aprosodic prominence. Speech disfluencies are transcribed in parallel to the prosodic transcription


%H

H%

L*

L%

FM

Fm

frontier

3.3.4 Rhapsodie










R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%







R resetD downstep

phonological


+



%/%






LH

H

mL

lL

D

%H

H%

L*

L%


%H

H%

L*

L%

FM

Fm

frontier

3.3.4 Rhapsodie




1Rhapsodie: reference prosody corpus of spoken French


MeLos

12/42


Rich Syntactic Description

Introduction






MeLos

13/42



Rich Syntactic Description [Obin et al., 2011a]

Objective

Modelling the linguistic constraint to assign prosodic events

State of the Art

� Surface description of syntactic characteristics [Black and Taylor, 1994]

� Some attempts to integrate a rich description of syntactic characteristics

[Ingulfen et al., 2005]

Issue

� Integration of a rich description of syntactic characteristics

Contribution

� Use of deep syntactic parsing to model speech prosody


MeLos

14/42



Linguistic Processing Used in this Work: ALPAGE

ALPAGE

[Villemonte de La Clergerie, 2005]

� Linguistic Processing Chain developed for

French

� Three modules: text pre-processing, surface

and deep parsing

Text Pre-Processing

� segmentation of a text into words and

sentences

Surface Parsing

� morpho-syntactic parsing (POS)


MeLos

15/42




Deep Parsing

Deep parsing is used to describe the deep syntactic structure of a sentence

� Formalism: Tree Adjoining Grammar (TAG) [Joshi et al., 1975]

The syntactic structure can be described in terms of constituency or dependency.

Constituency

� describes the hierarchical structure of a sentence

Dependency

� describes the local dependency that relates words of a sentence


MeLos

16/42




Constituency [Villemonte de La Clergerie, 2005]

S

incise

adv

�S

Figure 7: Arbre 171: adverbes

S

VMod

cln

V

VMod

V1

clr Infl

v

Figure 8: Arbre 198: verbe en construction canonique

3

S

incise

adv

�S


S

VMod

|

cln

V

VMod

V1

clr Infl

v


3

N

incise

adjP

adj

�N

Figure 2: Arbre 113: adjectif antpos

S

�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY

Figure 3: Arbre 23: ponctuation finale

N2

N

nc

Figure 4: Arbre 59: groupe nominal

Infl

VMod

V1

Infl

aux

�Infl

Figure 5: Arbre 67: auxiliaire

VMod

�VMod incise

↓PP

Figure 6: Arbre 166: attachement prpositionel sur un verbe

2

N

incise

adjP

adj

�N


S

�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY


N2

N

nc


Infl

VMod

V1

Infl

aux

�Infl


VMod

�VMod incise

↓PP


2

N

incise

adjP

adj

�N


S

�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY


N2

N

nc


Infl

VMod

V1

Infl

aux

�Infl


VMod

�VMod incise

↓PP


2

N

incise

adjP

adj

�N


S

�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY


N2

N

nc


Infl

VMod

V1

Infl

aux

�Infl


VMod

�VMod incise

↓PP


2

Pour la phrase Longtemps, je me suis couch de bonne heure..

E1F1|Longtemps

E1F2|, E1F3|je

E1F4|me

E1F5|suis

E1F6|couché

E1F7|de

E1F8|bonne

E1F9|heure

E1F10|..:_:lexical

cln:cln:lexical

être:aux:67

_:VMod:166

de:prep:4

coucher:v:198

clr:clr:lexical

_:incise:incise

_:S:23

,:_:lexical

bon:adj:113

longtemps:adv:171

heure:nc:59

void

S

vmod

subject

N2

N

S

void

clr

incise

PP

Infl

PP

prep N2

Figure 1: Arbre 4: groupe prpositionel

1

Longtemps je me suis couchÃ c� de bonne heure

4


4


4


4


4


4


4

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

S

incise

adv

�S


S

VMod

cln

V

VMod

V1

clr Infl

v


3

Longtemps je me suis couché de bonne heure

4


MeLos

17/42




Conversion into Dependency [Villemonte de La Clergerie, 2010]

longtempsadverb

jenominative

clitic

bonadjective

heurecommon

noundepreposition

mereflexive

clitic

êtreauxiliary verb

coucherverb

_incise

, _

_sentence

._

_verb modifier

incise

void

reflexiveclitic

subject

Infl.

S

V-mod

PP

N2

N2

S void

E1F1 | Longtemps

E1F2 | ,

E1F3 | je

E1F4 | me

E1F5 | suis

E1F6 | couché

E1F7 | de

E1F8 | bonne

E1F9 | heure

E1F10 | .


MeLos

18/42



Summary

Syntactic characteristics used to model speech prosody

conventional characteristics

Surface Parsing

(1) morpho-syntactic (M)

proposed characteristics

Deep Parsing

(2) dependency (D)

(3) constituency (C)

(4) adjunction (A): specific TAG operation

� describes a large variety of syntactic constructions (e.g., clause, incise)

� shown to be relevant to model speech prosody


MeLos

19/42



Evaluation of the Automatic Assignment of Prosodic Events

Scheme

� comparison of different sets

of syntactic characteristics to

model speech prosody

� speech database: neutral

read speech (9hrs)

� procedure: 10-fold

cross-validation

� metric: F-measure

Results

� dramatical improvement forprosodic boundaries

� FM = 20%� Fm = 10%

� no improvement for prosodicprominence: probably related tosemantic constraint, not syntax

� Performance of the automatic

assignment of Prosodic Events

FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re

MORPHODEPENDENCYCHUNKADJUNCTION

FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


FM Fm P0

0.2

0.4

0.6

0.8

1

0.95

0.54

0.34

0.85

0.50

0.34

0.76

0.44

0.33

0.74

0.44

0.33

F−m

easu

re


6

�

6

�

6


MeLos

20/42


Combination of Syntactic and Metric Constraints to Assign Pauses

Introduction






MeLos

21/42



Syntactic and Metric Constraints [Obin et al., 2011d]

ObjectiveModelling syntactic and metric constraints to assign pauses

State of the Art� Explicit integration of metric constraint in statistical modelling

[Schmid and Atterer, 2004]

Issue� Determine the adequate combination of syntactic & metric constraints

Contributions� Use of Segmental HMMs + Dempster-Shafer Fusion


MeLos

22/42



Syntactic and Metric Constraints [Obin et al., 2011d]

Segmental HMMSegmental HMM [Ostendorf et al., 1996] addresses several limitations of conventionalHMM:

� in particular, segment duration is explicitly modelled (metric constraint)

d= sequence that described the distance between consecutive pauseso= sequence that describes observed syntactic characteristics

p(d|o) ∝ p(o|d)| {z }

linguisticcontribution

× p(d)|{z}

metriccontribution

(1)

Dempster-Shafer Fusion� the reliability that can be conferred to different sources of information is explicitly

formulated� used to balance the syntactic and metric contributions to assign pauses


MeLos

23/42



Evaluation of the Automatic Assignment of Pauses

Scheme

� same database, procedure, metric

� only for FM (can be also used for Fm)

Results

� optimal configuration ofsyntactic/metric constraintssignificantly outperformsconventional methods

� optimal: 96.3%� linguistic: 95.0%� linguistic/metric: 92.1%

� syntactic constraint seems to have agreater influence than metricconstraint

� example of speech synthesis with

automatic assignment of prosodic

events

� Performance of the automatic

assignment of Pauses

−1−0.5

00.5

1

DMMDCMCMDCDCA

MAMDADADCAMCACAMDCA

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

α,βlinguistic context

F 10.65

0.7

0.75

0.8

0.85

0.9

0.95

−1

−0.5

0

0.5

1

DMMDCMCMD

CDCAMAMD

ADADCAMC

ACAMDCA

α,β

lingui

sticcon

text

0.65

0.7

0.75

0.8

0.85

0.9

0.95

metricmetric/linguistic

linguistic

weight

optimal


MeLos

24/42


Stylization and Trajectory Modelling of F0 contours

Introduction






MeLos

25/42



ObjectiveModelling and Synthesizing F0 variations from a text

State of the Art� short-term modelling + local trajectory constraint (conventional HMM)

[Tokuda et al., 2003]� stylization + no trajectory constraint [Mishra et al., 2006]� short-term modelling + long-term trajectory constraint

[Latorre and Akamine, 2008, Qian et al., 2009]

Issues� combining stylization with trajectory modelling of F0 variations

Contributions� trajectory model fully based on the stylization of F0 contours over various

temporal domains


MeLos

26/42



Stylization of Speech Prosody [Obin et al., 2011b]

Stylization of F0 contours

� Modelling relevant melodic

variations

0 0.470

75

80

85

90

95

100

105

110

115

time [s]

f0

[Hz]

� Modelling contours over Various

Temporal Domains

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

syllablephrase


MeLos

27/42



Trajectory Modelling

Stylization + Conventional Modelling

� During synthesis, the sequence of contours is determined as the sequence of

mean contours (assumption of conditional independence)

� Example:

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R1 1.5 2 2.5 3 3.5 4 4.5

30

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R

6

6

� This may result into inadequate phrasing


MeLos

28/42



Trajectory Modelling

Stylization + Trajectory Modelling

� During synthesis, the sequence of contours is determined under the constraintof long-term trajectories

� Example:

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R

1 1.5 2 2.5 3 3.5 4 4.530

40

50

60

70

80

90

100

110

120

time [s]

f 0[H

z]

lo

ta

Z@

m@

sHi

ku

Se

d2

bO

n9R

6

�6

�

� The long-term trajectory constraints adequate phrasing


MeLos

29/42



Subjective Evaluation of Stylization/Trajectory ModelComparison of F0 models in SpeechSynthesis

� conventional HMM (HTS) [Yoshimura et al., 1999]

� stylization + trajectory1. syllable + 1-order adjacent syllables (1-ORDER)

2. syllable + minor prosodic phrase (AG)

3. syllable + major prosodic phrase (PG)

Procedure� CMOS preference experiment� 8x4 synthesized speech utterances� 20 native French speakers

� 1-order trajectory model is significantlypreferred

� long-term trajectories (AG/PG) partiallysuccessful (due to increase in complexity)

� CMOS

HTS 1ORDER AG PG−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

−0.38

−0.18

0.53

−0.34

CMOS


MeLos

30/42


Introduction






MeLos

31/42


Style: a matter of Situation

Individual and Shared Identities

Each communication act instantiates a style which is composed of:

� an individual speaking style that depends on the speaker identity

SPEAKER

LISTENER

� a conventional speaking style conditioned by a specific situation


MeLos

32/42


Objective

Modelling speaking style in speech synthesis

State of the Art

� modelling continuous characteristics of a speaking style (emotions)

[Yamagishi et al., 2004]

� modelling discrete characteristics of a speaking style [Bell et al., 2006]

Issues

� modelling the discrete and continuous characteristics of a speaking style

Contributions

� study of the ability of human listeners to identify a speaking style

� reference for the evaluation of speaking style modelling

� study of the capacity of discrete/continuous HMMs to model characteristics

of a speaking style


MeLos

33/42


Ability of Human Listeners to Identify a Speaking Style

Introduction






MeLos

34/42


Ability of Human Listeners to Identify a Speaking Style

Identification Ability of Speaking Style [Obin et al., 2010]

Objective� Study of the ability of human listeners

to identify the speaking style of naturalspeech

Experiment� 40 speech utterances with 4 speaking

styles (church office, political speech,journalistic speech, sportcommentary)

� delexicalized to remove lexical access

� 72 participants (various languagebackground)

� multiple choice identification ofspeaking style by human listeners

� Confusion of natural speaking stylesby human listeners

−80−60

−40−20

020

4060

−80

−60

−40

−20

0

20

40

60

−40

−30

−20

−10

0

10

20

30

40

MDS #2

JJJ

J

S

J

SS

S

J

S

SS

S

S

J

S

JJJM

P

M

P

PM

MDS #1

M

PP

MMM

P

M

P

PP

PMM

MD

S #3


MeLos

35/42


Discrete/Continuous Modelling of Speaking Style

Introduction






MeLos

36/42



Modelling Speaking Style [Obin et al., 2011c]

Objectives� application of discrete/continuous

model to speaking style� capacity of HMM-based speech

synthesis to model speaking style

Principle� average modelling of discrete and

continuous characteristics of aspeaking style (multiple speakers)

� Description of speakers characteristics

−2.2 −2 −1.8 −1.6 −1.4 −1.2

4.5

4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

M1

log syllable duration

log

f 0

M2

M3

M4

M5M6M7

P1P2

P3

P4P5J1

J2

J3J4J5

S1

S2S3 S4

S5S6

DISC

RETE

/CON

TINU

OUSM

ODEL

LING

OFSP

EAKI

NGST

YLE

INHM

M-B

ASED

SPEE

CHSY

NTHE

SIS:

DESI

GNAN

DEV

ALUA

TION

Nic

ola

sO

bin

1,2 ,

Pie

rre

La

nch

an

tin

1

An

ne

La

ch

ere

t2 ,

Xa

vie

rR

od

et

1

1An

alysis

-Syn

thes

isTe

am,I

RCAM

,Par

is,Fr

ance

2M

odyc

oLa

b.,Un

iversi

tyof

Paris

Oues

t-La

Defe

nse,

Nant

erre

,Fra

nce

nobi

n@ir

cam.

fr,

lanc

hant

in@i

rcam

.fr,

anne

.lac

here

t@u-

pari

s10.

fr,

rode

t@ir

cam.

fr

ABST

RACT

This

pape

ras

sess

esth

eab

ility

ofa

HMM

-bas

edsp

eech

synt

he-

sissy

stem

sto

mod

elth

esp

eech

char

acter

istics

ofva

rious

spea

k-in

gsty

les1 .A

disc

rete/

cont

inuo

usHM

Mis

pres

ented

tom

odel

the

sym

bolic

and

acou

stic

spee

chch

arac

terist

icsof

asp

eaki

ngsty

le.Th

epr

opos

edm

odel

isus

edto

mod

elth

eav

erag

ech

arac

terist

icsof

asp

eaki

ngsty

leth

atis

shar

edam

ong

vario

ussp

eake

rs,de

pend

-in

gon

spec

ifics

ituati

onso

fspe

ech

com

mun

icatio

n.Th

eeva

luati

onco

nsist

sofa

nid

entifi

catio

nex

perim

ento

f4sp

eaki

ngsty

lesba

sed

onde

lexica

lized

spee

ch,a

ndco

mpa

red

toa

simila

rexp

erim

ento

nna

tura

lspe

ech.

The

com

paris

onis

disc

usse

dan

dre

veals

that

dis-

crete

/cont

inuo

usHM

Mco

nsist

ently

mod

elsth

espe

ech

char

acter

is-tic

sofa

spea

king

style.

Inde

xTer

ms:

spea

king

style,

spee

chsy

nthe

sis,s

peec

hpr

osod

y,av

-er

agem

odell

ing.

1.IN

TROD

UCTI

ON

Each

spea

kerh

ashi

sown

spea

kin

gst

yle

which

cons

titut

eshi

svoc

alsig

natu

re,a

ndap

arto

fhis

iden

tity.

Neve

rthele

ss,a

spea

kerc

ontin

u-ou

slyad

apth

issp

eaki

ngsty

leac

cord

ing

tosp

ecifi

ccom

mun

icatio

nsit

uatio

ns,a

ndto

hise

mot

iona

lstat

e.In

parti

cular

,eac

hsit

uatio

nal

cont

extd

eterm

ines

aspe

cificm

odeo

fpro

ducti

onas

socia

tedwi

thit

-agen

re-w

hich

isde

fined

byas

etof

conv

entio

nsof

form

andc

on-

tentt

hata

resh

ared

amon

gall

ofits

prod

uctio

ns[1

].In

parti

cular

,a

spec

ificd

iscou

rsege

nre

(DG)

relat

esto

asp

ecifi

csp

ea

kin

gst

yle

.Co

nseq

uent

ly,a

spea

kera

dapt

shis

spea

king

style

toea

chsp

ecifi

csit

uatio

nde

pend

ing

onth

efo

rmal

conv

entio

nsth

atar

eas

socia

tedwi

thth

esit

uatio

n,hi

sa-p

riori

know

ledge

abou

tthe

seco

nven

tions

,an

dhi

scom

peten

ceto

adap

this

spea

king

style.

Thus

,eac

hco

m-

mun

icatio

nac

tins

tantia

tesas

tyle

which

isco

mpo

sed

ofas

tyle

that

depe

ndso

nth

esp

eake

ride

ntity

,and

aco

nven

tiona

lspe

akin

gsty

leth

atis

cond

ition

edby

aspe

cifics

ituati

on.

Insp

eech

synt

hesis

,meth

odsh

aveb

eenp

ropo

sedt

omod

elan

dada

ptth

esym

bolic

[3,4

]and

acou

stics

peec

hcha

racte

ristic

sofa

spea

king

style,

with

appl

icatio

nto

emot

iona

lspe

ech

synt

hesis

[2].

Howe

ver,

nostu

dyex

istso

nth

ejoi

ntm

odell

ing

ofth

esym

bolic

and

acou

stic

char

acter

istics

ofsp

eaki

ngsty

le,an

dsp

eaki

ngsty

leac

ousti

cm

od-

ellin

gge

nera

llylim

itsto

them

odell

ing

ofem

otio

n,wi

thra

reex

ten-

sions

toot

hers

ourc

esof

spea

king

styles

varia

tions

[5].

This

pape

rpre

sent

san

aver

age

disc

rete/

cont

inuo

usHM

Mwh

ichis

appl

iedto

thes

peak

ing

style

mod

ellin

gof

vario

usdi

scou

rsege

nres

1 This

study

was

supp

orted

byAN

RRh

apso

die

07Co

rp-0

30-0

1;re

fer-

ence

pros

odyc

orpu

sofs

poke

nFre

nch;

Fren

chNa

tiona

lAge

ncyo

fres

earc

h;20

08-2

012.

insp

eech

synt

hesis

,and

asse

sses

wheth

erth

emod

elad

equa

telyc

ap-

ture

sthe

spee

chpr

osod

ycha

racte

ristic

sofa

spea

king

style.

Incid

en-

tally,

ther

obus

tnes

soft

heHM

M-b

ased

spee

chsy

nthe

sisis

evalu

ated

inth

econ

ditio

nsof

real-

world

appl

icatio

ns.T

hepa

peri

sorg

anize

das

follo

ws:t

hesp

eaki

ngsty

leco

rpus

desig

nis

desc

ribed

inse

ction

2;th

eave

rage

disc

rete/

cont

inuo

usHM

Mm

odel

ispr

esen

tedin

sec-

tion

3;th

eev

aluati

onis

pres

ented

and

disc

usse

din

secti

ons4

and

5.2.

SPEE

CH&

TEXT

MAT

ERIA

L

2.1.

Corp

usDe

sign

Fort

hepu

rpos

eofs

peak

ing

style

spee

chsy

nthe

sis,a

4-ho

urm

ulti-

spea

kers

spee

chda

tabas

ewa

sdes

igne

d.Th

esp

eech

datab

ase

con-

sists

offo

urdi

ffere

ntDG

’s:ca

thol

icm

assc

erem

ony,

polit

ical,

jour

-na

listic

,and

spor

tcom

men

tary.

Inor

dert

ore

duce

the

DGin

tra-

varia

bilit

y,th

edi

ffere

ntDG

swer

ere

strict

edto

spec

ific

situa

tiona

lco

ntex

ts(se

elist

belo

w)an

dto

male

spea

kers

only.

−2.2

−2−1

.8−1

.6−1

.4−1

.2

4.54.64.74.84.955.15.25.35.45.5

M1

log sy

llable

dura

tion

log f0

M2

M3

M4

M5M6

M7 P1P2P3

P4P5

J1

J2J3J4

J5S1

S2S3

S4S5

S6

Fig.

1.Pr

osod

icde

scrip

tion

ofth

espe

akin

gsty

lesde

-pe

ndin

gon

thes

peak

er.M

ean

and

varia

nceo

ff0

and

spee

chra

te.

log

f 0

log(

1/sp

eech

rate

)Th

efo

llowi

ngis

ade

scrip

tion

ofth

efo

urse

lected

DG’s:

DISCRETE/CONTINUOUS MODELLING OF SPEAKING STYLEIN HMM-BASED SPEECH SYNTHESIS:

DESIGN AND EVALUATION

Nicolas Obin1,2

, Pierre Lanchantin1

Anne Lacheret2, Xavier Rodet

1

1 Analysis-Synthesis Team, IRCAM, Paris, France2 Modyco Lab., University of Paris Ouest - La Defense, Nanterre, France

[email protected], [email protected], [email protected], [email protected]

ABSTRACT

This paper assesses the ability of a HMM-based speech synthe-sis systems to model the speech characteristics of various speak-ing styles1. A discrete/continuous HMM is presented to model thesymbolic and acoustic speech characteristics of a speaking style.The proposed model is used to model the average characteristicsof a speaking style that is shared among various speakers, depend-ing on specific situations of speech communication. The evaluationconsists of an identification experiment of 4 speaking styles basedon delexicalized speech, and compared to a similar experiment onnatural speech. The comparison is discussed and reveals that dis-crete/continuous HMM consistently models the speech characteris-tics of a speaking style.Index Terms: speaking style, speech synthesis, speech prosody, av-erage modelling.

1. INTRODUCTION

Each speaker has his own speaking style which constitutes his vocalsignature, and a part of his identity. Nevertheless, a speaker continu-ously adapt his speaking style according to specific communicationsituations, and to his emotional state. In particular, each situationalcontext determines a specific mode of production associated with it- a genre - which is defined by a set of conventions of form and con-tent that are shared among all of its productions [1]. In particular,a specific discourse genre (DG) relates to a specific speaking style.Consequently, a speaker adapts his speaking style to each specificsituation depending on the formal conventions that are associatedwith the situation, his a-priori knowledge about these conventions,and his competence to adapt his speaking style. Thus, each com-munication act instantiates a style which is composed of a style thatdepends on the speaker identity, and a conventional speaking stylethat is conditioned by a specific situation.In speech synthesis, methods have been proposed to model and adaptthe symbolic [3, 4] and acoustic speech characteristics of a speakingstyle, with application to emotional speech synthesis [2]. However,no study exists on the joint modelling of the symbolic and acousticcharacteristics of speaking style, and speaking style acoustic mod-elling generally limits to the modelling of emotion, with rare exten-sions to other sources of speaking styles variations [5].This paper presents an average discrete/continuous HMM which isapplied to the speaking style modelling of various discourse genres

1This study was supported by ANR Rhapsodie 07 Corp-030-01; refer-ence prosody corpus of spoken French; French National Agency of research;2008-2012.

in speech synthesis, and assesses whether the model adequately cap-tures the speech prosody characteristics of a speaking style. Inciden-tally, the robustness of the HMM-based speech synthesis is evaluatedin the conditions of real-world applications. The paper is organizedas follows: the speaking style corpus design is described in section2; the average discrete/continuous HMM model is presented in sec-tion 3; the evaluation is presented and discussed in sections 4 and5.

2. SPEECH & TEXT MATERIAL

2.1. Corpus Design

For the purpose of speaking style speech synthesis, a 4-hour multi-speakers speech database was designed. The speech database con-sists of four different DG’s: catholic mass ceremony, political, jour-nalistic, and sport commentary. In order to reduce the DG intra-variability, the different DGs were restricted to specific situationalcontexts (see list below) and to male speakers only.

−2.2 −2 −1.8 −1.6 −1.4 −1.2

4.5

4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

M1

log syllable duration

log

f 0

M2

M3

M4

M5M6M7

P1P2

P3

P4P5J1

J2

J3J4J5

S1

S2S3 S4

S5S6

Fig. 1. Prosodic description of the speaking styles de-pending on the speaker. Mean and variance of f0 andspeech rate.

log f0

log(1/speech rate)The following is a description of the fourselected DG’s:


MeLos

37/42



Evaluation of Ability of Human Listeners to Identify a SyntheticSpeaking Style

Experiment� 40 synthesized speech utterances +

delexicalized (same as previously)� 50 participants (various language background)� multiple choice identification of speaking style

by human listeners

Results� discrete/continuous HMMs

consistently model the

characteristics of a speaking style

(with exception of church office)

Question� what is the contribution of

discrete/continuous characteristics?

� Comparison of identificationobtained for natural andsynthetic speech

MASS POLITICAL JOURNAL SPORT0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.12

0.28

0.51

0.68

0.38

0.34

0.54

0.70

Cohe

n’s

Kapp

a

NATURAL SPEECHSYNTHESIZED SPEECH6

�

6

�

6

�

6


MeLos

38/42



Synthesis of Speaking Styles

TEXT

2

SP

EA

KIN

GS

TYLE

2Neutral, Fairy Tale, “Litlle Tom Thumb”


MeLos

39/42


Introduction






MeLos

40/42


Conclusion

Contributions to the Statistical Modelling of Speech Prosody

1. Statistical modelling of discrete and continuous characteristics of speech prosody2. Combination of linguistic and metric constraints to assign pauses3. Stylization and trajectory modelling of F0 contours

Contribution to the Integration of a Rich Linguistic Description

4. Use of deep syntactic parsing to model speech prosody characteristics

Contributions to the Modelling of Speaking Style

5. Study of the ability of listeners to identify a speaking style6. Reference identification performance for the evaluation of speaking style modelling7. Ability of discrete/continuous HMMs to model the characteristics of a speaking

style


MeLos

41/42


Further Directions

Modelling the Variety of Speech Prosody

� a speaker has various alternatives to

pronounce a same utterance (intra-speaker

variability)

� varying speech prosody will certainly improve

the naturalness of synthetic speech

� examples of various speech prosody

determined for the same sentence

Unified Modelling

� joint modelling of discrete/continuous

characteristics

� that also accounts for alternatives

� objective: improving the coherence and variety

of speech prosody

SPEAKER

HEY PATRICK! HEY PA-

TRICK!HEY ##

PA-

TRICK!


MeLos

42/42



MeLos

Analysis and Modelling of Speech Prosody and Speaking Style filedescription level symbol description prominence P prosodic prominence local phonetic variations l/L low pitch m/M middle

Documents