LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM-BASED UNIT SELECTION SPEECH SYNTHESIS Xiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, P.R.China Abstract This paper presents a method using unit embedding called unit vector for improving the HMM-based unit selection The model is composed of three DNNs Unit2Vec learning the unit embedding Another two modeling cost calculations The concatenation cost can capture long-term dependencies between adjacent units The method improves the traditional HMM- method Unit vectors display phone-dependent clustering properties Unit2Vec – to learn unit embedding Experiments Average preference scores (%) on speech quality using the Chinese corpus N/P: no preference Prop_All > Prop_TC (p-values < 0.001) Prop_TC vs Baseline Similar preference (p > 0.05) Prop_TC make sound more expressive than baseline but it introduced more glitches. Baseline Prop_TC Prop_All N/P p 40.00 31.67 - 28.33 0.0882 15.33 - 55.67 29.00 <0.001 - 15.00 52.67 32.33 <0.001 Calculate target and concatenation cost • Considering the high dimension of linguistic features, model extracts BN feature • targ measures overall acoustic difference • measures the long-term dependencies • Pruning search strategy reduces amount of calculation Conditions Database Chinese newspaper corpus with 12219 utterances from a female speaker. training/validation/test set: 11608/611/100 Acoustic Features Composition: 12-order MCCs,1-order power, 1-order F0,and 1-order binary U/V flag. Features contain its dynamic properties, (12+1+1)*3+1=43 dims. Systems Baseline Prop_TC Prop_All Prop_TC replaces part of target cost in Baseline Prop_All further replaces part of concatenation cost in Prop_TC Average reconstruction errors of DNNs Unit2Vec MCD(dB) 2.1338 3.4267 3.3449 F0-RMSE (Hz) 18.4240 38.2256 35.1925 CORR 0.9673 0.8524 0.8746 V/UV error (%) 0.7049 5.2502 4.9525 MCD: distortion in mel-cepstrum F0-RMSE and V/UV error: distortion in F0 CORR: Pearson correlation coefficient > because the using history HMM-Based Unit Selection Target linguistic feature Candidate linguistic feature targ , = − 2 con − ,⋯, −1 , , = − − ,⋯, −1 , 2 Preceding candidate unit vectors (b) model. (a) model Conclusion Unit2Vec has learned fixed-length vector for unit embedding and has modeled cost calculation Subjective evaluation demonstrate the effectiveness of these models can handle long-term dependencies among candidate units Baseline HMM-based unit-selection system refer this paper [1] [1] Z.-H. Ling and R.-H. Wang “HMM-based hierarchical unit selection combining kullback-leibler divergence with likelihood criterion,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4. IEEE, 2007, pp. IV–1245. Visualization of the phone-dependent distributions of learnt unit vectors using t-SNE Proposed Method Save the matrix