[course site] Day 4 Lecture 1 Speaker ID II Javier Hernando
Feb 07, 2017
[course site]
Day 4 Lecture 1
Speaker ID II
Javier Hernando
3
DL Modeling i-Vectors
O. Ghahabi, J. Hernando, Deep Learning Backend for Single and Multi-Session i-Vector Speaker Recognition, to be appear in IEEE Trans. Audio, Speech and Language Processing
4
Decoder
Problem:
A large number of impostor data (negative samples)Very few number of target data (positive samples)
Solutions:
Global Impostor SelectionClustering using K-means Equally distributing positive and negative samples among minibatches
12
DL Feature Classification
P. Safari, O. Ghahabi, J. Hernando, “Restricted Boltzmann Machines for speaker vector extraction and feature classification”, Proc. URSI 2016
16
RBM vectors
P. Safari, O. Ghahabi, and J. Hernando. From Features to Speaker Vectors by means of Restricted Boltzmann Machine Adaptation. Odyssey Speaker and Language Recognition Workshop,, June, 2016
20
CDBN vectors
Unsupervised feature learning for audio classification using convolutional deep belifef networks, H. Lee et al., Advances in Neural Information Processing Systems, 22:1096–1104, 2009
Speaker Clustering: Speaker Comparison
Shallow Speaker Comparison
Harsha et al. “Artificial Neural Network Features for Speaker Diarization”. IEEE Spoken Language Technology Workshop. (2014) 402-406
Speaker errors obtained on AMI and ICSI datasets for matched and mismatched training conditions. MFCC corresponds to baseline clustering using BIC. ANN+MFCC is referred to the ANN shown in right figure.
Paper 9’
Speaker Embeddings
Mickael Rouvier et al. “Speaker Diarization trough Speaker Embeddings”. 23rd European Signal Processing Conference. (2015)
Speaker Embeddings
500 size Speaker Embeddings rearranged in 10x50. Representation of two utterances from each speaker.
2D projection of four Speaker Embeddings using PCA.
CNN BN Feature
Yanik Lukic et al. “Speaker Identification and Clustering using Convolutional Neural Networks”. In 2016 IEEE International workshop on machine learning for signal processing. (2016)
Five Speaker representations in 2 dimensions. Left figure show the output vector of the softmax layer L8. Right figure correspond to the same output vector of L5 dense layer. Differents colors are assigned to different speakers.