Top Banner
Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe
30

Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

Jan 04, 2016

Download

Documents

Phyllis Preston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

Towards optimal TTS corpora

CADIC DidierBOIDIN CedricD'ALESSANDRO Christophe

Page 2: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

2 Towards optimal TTS corpora France Telecom Group restricted

Unit-selection TTS

This is an example.

Linguistic modules

Unit selection

Unit concatenati

on

Speaker database

Page 3: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

3 Towards optimal TTS corpora France Telecom Group restricted

Unit-selection TTS

This is an example.

Linguistic modules

Unit selection

Unit concatenati

on

How to prepare the recording

script

?

Page 4: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

4 Towards optimal TTS corpora France Telecom Group restricted

Preparation of the recording script

Criterion = diphones and triphones coverage

Algorithm = greedy, corpus condensation

Classic optimization approach

Page 5: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

5 Towards optimal TTS corpora France Telecom Group restricted

Preparation of the recording script

Criterion = diphones and triphones coverage

Algorithm = greedy, corpus condensation

Classic optimization approach

The link between di- or triphones coverage and the final TTS quality is not clear

The process is constrained by the limited combinations encountered in the finite reference corpus

Page 6: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

6 Towards optimal TTS corpora France Telecom Group restricted

Preparation of the recording script

Criterion = diphones and triphones coverage

Algorithm = greedy, corpus condensation

Classic optimization approach

Criterion = vocalic sandwiches coverage

Algorithm = greedy, sentence construction

Our optimization approach

Page 7: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

7 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009)

Page 8: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

8 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

Finite State Transducers compute "optimal" sequences of sandwiches, so that:

- the coverage increment is maximized (greedy approach)

- only sandwich transitions observed in a reference corpus are allowed

Neither syntactic nor semantic consideration generated sequences are likely to be nonsense

Towards optimality

Towards readability

Development of a semi-automatic tool, allowing an operator to iteratively correct generated sequences, in order to build an acceptable and almost optimal sentence.

Page 9: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

9 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

(I don't the week of the six.)

Page 10: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

10 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

(I don't…)

Page 11: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

11 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

Page 12: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

12 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

(I don't take it out…)

Page 13: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

13 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

Page 14: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

14 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

Page 15: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

15 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

Page 16: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

16 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

(I don't take it out the weeks…)

Page 17: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

17 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

(I don't take it out the weeks like you.)

Page 18: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

18 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

(I don't take it out the black weeks,)

Page 19: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

19 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

(I don't take it out the black weeks,)

Page 20: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

20 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

The procedure is time-consuming (around 3 min – 50 steps – to build a plausible sentence)

Most built sentences lack semantic coherence (redundancy is minimized at the price of semantics)

Built scripts are much denser than with corpus condensation

Page 21: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

21 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

Density increase of 30 to 40%

compared to condensation

San

dw

ich

co

vera

ge

rat

e (%

)

Page 22: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

22 Towards optimal TTS corpora France Telecom Group restricted

Conclusion

For the creation of unit-selection TTS recording scripts:• We suggested using the Vocalic Sandwiches Coverage Rate as optimization criterion (since it is a convenient symbolic approximation of the selection cost)•We presented a novel corpus building technique, based on sentence construction rather than sentence selection. The procedure is time-consuming and built sentences tend to lack semantic coherence, but a density increase of 30 to 40% can be otained.

Recent work (SSW7 submission)

•Extensive evaluation of the vocalic sandwiches as optimization criterion•Construction of full recordings scripts. Density estimations seem to be confirmed. However semantic limitations had significant repercussions on the reading stage.

Page 23: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

23 Towards optimal TTS corpora France Telecom Group restricted

Page 24: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

24 Towards optimal TTS corpora France Telecom Group restricted

Database constitution: two ways

Rushes from DVD, websites…

Unique way to inaccessible voices

Expensive process, poor TTS quality

Control of the content best TTS quality

OR

Dedicated recordings (script read by a speaker)

Page 25: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

25 Towards optimal TTS corpora France Telecom Group restricted

Database constitution: two ways

Rushes from DVD, websites…

Unique way to inaccessible voices

Expensive process, poor TTS quality

Control of the content best TTS quality

OR

Dedicated recordings (script read by a speaker)

Page 26: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

26 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009)

Given an input sentence, the selection module searches the database for units presenting:

Maximum adequation to the target sequence(target cost)

Minimum distorsion between consecutive units(concatenation cost)

Illustration

Page 27: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

27 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009)

Given an input sentence, the selection module searches the database for units presenting:

Maximum adequation to the target sequence(target cost)

Minimum distorsion between consecutive units(concatenation cost)

Illustration

Page 28: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

28 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009)

Given an input sentence, the selection module searches the database for units presenting:

Maximum adequation to the target sequence(target cost)

Minimum distorsion between consecutive units(concatenation cost)

Illustration

Page 29: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

29 Towards optimal TTS corpora France Telecom Group restricted

Vocalic sandwiches (Cadic et al, Interspeech 2009)

Correlations of coverage rates with the selection cost:

Vocalic sandwiches -0.78

Diphones -0.44

Triphones -0.64

Illustration

Page 30: Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.

30 Towards optimal TTS corpora France Telecom Group restricted

Sentence construction

Finite State Transducers compute "optimal" sequences of sandwiches, so that:

- the coverage increment is maximized (greedy approach)

- only sandwich transitions observed in a reference corpus are allowed

Optimal sequence of length 1

Coverage increment is averaged over the sequence length15 FST give 15 optimal sandwich sequences for each length ≦ 15

Optimal sequence of length 2Optimal sequence of length 3Optimal sequence of length 4 …Optimal sequence of length 15

#_b_i_z_# #__ _t_e_# #_i_l_p_a_ _t_e_#

#_i_l_p_a_ _a_s_j__# #_i_l_p_a_ _f__ _p_u_ _d_ _m_ _d_ _p_u__v_w_a_ _d_ _l_a_p_a_s_s__t_k_o_m_ _#