Top Banner
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012
23

Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Dec 17, 2015

Download

Documents

Garey Carroll
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Speaking Style Conversion

Dr. Elizabeth GodoySpeech Processing Guest Lecture

December 11, 2012

Page 2: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Apply VC principles to a different problem…

December 11, 20122 E.Godoy, Speaking Style Conversion

Page 3: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Speech Intelligibility Context

E.Godoy, Speaking Style Conversion3

Speech is often heard in adverse conditions Noisy environments Listener has difficulty hearing/understanding

How to transform speech to make it more intelligible…? To make speech synthesis systems more effective

December 11, 2012

Example of speech with environmental barriers: the speech is not very intelligible!

noise no noise

Page 4: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Intelligible Speaking Styles

December 11, 2012E.Godoy, Speaking Style Conversion4

I. Lombard speech Speaker is immersed in noise Human reflex to increase the speech loudness

II. Clear speech Listener faces barrier (noise, hearing, language,

…) Speaker adapts strategy to increase speech

clarity

normal Lombard

casual clear

Page 5: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

VC to improve speech intelligibility?

E.Godoy, Speaking Style Conversion5

Voice Conversion Modify speech to change the speaker identity Learn transformation from source-to-target

speaker

Speaking Style Conversion Modify speech to improve intelligibility Determine transformation from normal-to-

intelligible style

Spectral Envelope: still very important!

December 11, 2012

Page 6: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Overview: Analyses-to-Modifications

E.Godoy, Speaking Style Conversion6

I. Acoustic analyses to identify (mainly spectral) characteristics of Lombard & Clear styles

i. Average Spectra ii. Vowel Spaces

II. Result of analyses inspire spectral modifications to improve intelligibility

i. Spectral energy band boosting (corrective filters)ii. Formant shifting (frequency warping)

December 11, 2012

Page 7: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Corpora

E.Godoy, Speaking Style Conversion7

Lombard-normal: Grid 8 speakers (4 male, 4 female) 50 sentences each LombardNinf96: most extreme (Lu & Cooke)

Clear-casual: LUCID read sentences 8 speakers (4 male, 4 female) 50 sentences each Read speech: most exaggerated (Baker & Hazan)

December 11, 2012

Page 8: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Average Relative Spectra

December 11, 2012E.Godoy, Speaking Style Conversion8

Recall Amplitude Scaling in DFWA

Average Relative spectra is similar: difference between normal (X) and intelligible (Y)

style Average across all frames

)))((log())(log())(log( 1 fWSfSfA qxq

yqq

))(log())(log())(log( fSfSfS Xq

Yq

R

Page 9: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Average Relative Spectra (by Speaker)

E.Godoy, Speaking Style Conversion9

0

2000

4000

6000

2

4

6

8

-4

-2

0

2

4

Hz

LUCID Average Relative Spectra for each speaker

speaker index

dB

0

2000

4000

6000

2

4

6

8

-10

-5

0

5

Hz

GRID Average Relative Spectra for each speaker

speaker index

dB

Lombard-normal Clear-casual

December 11, 2012

Page 10: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Average Relative Spectra (Overall)

Lombard speech: Spectral energy boosting “where formants are” (~500-4500Hz)

Clear speech: Varies depending on speaker strategy, extent of differences mild overall

E.Godoy, Speaking Style Conversion10

0 1000 2000 3000 4000 5000 6000 7000 8000-8

-6

-4

-2

0

2

4

6Average Relative Spectra: All frames, All speakers

Hz

dB

Lombard-normal

Clear-casual

December 11, 2012

Page 11: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Vowel Spaces (average for all speakers)

E.Godoy, Speaking Style Conversion11

Lombard speech: Vowel Space Translation Clear speech: Vowel Space Expansion

300 350 400 450 500 550 600 650 700 750 800800

1000

1200

1400

1600

1800

2000

2200

2400

2600

F1 (Hz)

F2

(Hz)

Clear-casual: Vowel Space, ALL Speakers

casual

clear

350 400 450 500 550 600 650 7001000

1200

1400

1600

1800

2000

2200

2400

F1 (Hz)

F2

(Hz)

Lombard-normal: Vowel Space, ALL Speakers

normal

lombard

December 11, 2012

Page 12: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Inspiration for Speech Modifications

E.Godoy, Speaking Style Conversion12

1. Spectral energy band boosting (Lombard)2. Vowel space expansion (Clear)

Features attributed with increased speech intelligibility

Though not observed together in human speech production…

Signal processing algorithms can accomplish both!

December 11, 2012

Page 13: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Spectral Energy Band Boosting

E.Godoy, Speaking Style Conversion13

Corrective Filters

0 1000 2000 3000 4000 5000 6000 7000 8000-15

-10

-5

0

5

10

15

20

Hz

dB

Spectral Energy Band Boosting, Varying Gain 0:0.5:3

0 1000 2000 3000 4000 5000 6000 7000 8000-15

-10

-5

0

5

10

15

Hz

dB

Average Correction Filter for All Speakers

all frames

Enhanced (Lombard: high SII

Lombard-inspired & Enhanced (high SII) Corrective Filter: Varying Gain

December 11, 2012

Page 14: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Frequency Warping for VS Expansion

December 11, 2012E.Godoy, Speaking Style Conversion14

Curve fitting formant shifts inspires warping…

300 350 400 450 500 550 600 650 700 750 800800

1000

1200

1400

1600

1800

2000

2200

2400

2600

F1 (Hz)

F2

(Hz)

Clear-casual: Vowel Space, ALL Speakers

casual

clear

0 500 1000 1500 2000 2500 3000-250

-200

-150

-100

-50

0

50

100

150

Casual F1 and F2 (Hz)

Hz

LUCID: Frequency differences for F1, F2; ALL

F1diff

F2diff

Page 15: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Sound Samples

E.Godoy, Speaking Style Conversion15

With Noise (SSN, 0dB) Original Warp Boost BW

No Noise Original WarpE Boost BW

December 11, 2012

Page 16: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Want more ?

E.Godoy, Speaking Style Conversion16

See Maria’s presentation for more details …

December 11, 2012

Page 17: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Voice & Speaking Style Conversion Parallels

December 11, 2012E.Godoy, Speaking Style Conversion17

Voice Conversion Dynamic Frequency Warping + Amplitude Scaling

(based on acoustic-phonetic spaces of source & target speakers)

Speaking Style Conversion Frequency Warping + Corrective Filter

1. Clear-speech inspired frequency warping for vowel space expansion

2. Lombard-speech inspired corrective filters to increase loudness

Page 18: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Thank you!

More Questions?

Page 19: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Extras…

Page 20: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Objective Metrics for Evaluation

December 11, 2012E.Godoy, Speaking Style Conversion20

I. Loudness Energy in frequency bands weighted based on

human hearing

II. Speech Intelligibility Index (SII) Energy & modulations in frequency bands

relative to a noise masker

Page 21: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Loudness Distributions

E.Godoy, Speaking Style Conversion21

Lombard speech: “louder” for voiced (bi-modal) Clear speech: not “louder” than casual speech Transients: neither style distinguishes on average

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

Loudness Histogram

Loudness value

casual

clear

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.005

0.01

0.015

0.02

0.025

0.03

Loudness Histogram

Loudness value

normal

lombard

December 11, 2012

Page 22: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Extended SII Distributions

E.Godoy, Speaking Style Conversion22

extSII highly correlated with ave loudness Lombard speech objectively more intelligible Clear speech intelligibility gain not captured by extSII

limitations of objective intelligibility metrics

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.005

0.01

0.015

0.02

0.025

0.03extended SII Histogram

SII

casual

clear

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.005

0.01

0.015

0.02

0.025

0.03extended SII Histogram

SII

normal

lombard

December 11, 2012

Page 23: Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.

Observations from Analyses

E.Godoy, Speaking Style Conversion23

Lombard Speech Spectral boosting in inclusive formant region

Increase in Loudness (also extSII) Vowel space translation, but no expansion

Clear Speech Small changes in average spectra (slight spectral “flattening”) Consistent vowel space expansion

Greater vowel discrimination Comparison between styles

Acoustic differences translate into perceptual distinctions linked to intelligibility gains

Spectral boosting & Vowel space expansion: mutually exclusive

December 11, 2012