Top Banner
[course site] Day 4 Lecture 1 Speaker ID II Javier Hernando
32

Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

Feb 07, 2017

Download

Data & Analytics

Xavier Giro
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

[course site]

Day 4 Lecture 1

Speaker ID II

Javier Hernando

Page 2: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

2

DL Modeling i-Vectors

Page 3: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

3

DL Modeling i-Vectors

O. Ghahabi, J. Hernando, Deep Learning Backend for Single and Multi-Session i-Vector Speaker Recognition, to be appear in IEEE Trans. Audio, Speech and Language Processing

Page 4: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

4

Decoder

Problem:

A large number of impostor data (negative samples)Very few number of target data (positive samples)

Solutions:

Global Impostor SelectionClustering using K-means Equally distributing positive and negative samples among minibatches

Page 5: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

DL Modeling i-Vectors

5

Page 6: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

6

DL Modeling i-Vectors

Page 7: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

7

DL Modeling i-Vectors

Page 8: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

8

DL Modeling i-Vectors

Page 9: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

9

DL Modeling i-Vectors

Page 10: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

10

DL Feature Classification

Page 11: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

DL Feature Classification

Speaker Verification Speaker Identification

Credit S. H. Yella,

Page 12: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

12

DL Feature Classification

P. Safari, O. Ghahabi, J. Hernando, “Restricted Boltzmann Machines for speaker vector extraction and feature classification”, Proc. URSI 2016

Page 13: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

13

DL i-vector Extraction

Page 14: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

14

DL i-vector Extraction

Page 15: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

15

DL ‘speaker-vectors’

Page 16: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

16

RBM vectors

P. Safari, O. Ghahabi, and J. Hernando. From Features to Speaker Vectors by means of Restricted Boltzmann Machine Adaptation. Odyssey Speaker and Language Recognition Workshop,, June, 2016

Page 17: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

17

RBM vectors

Page 18: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

18

RBM vectors

Page 19: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

19

CDBN vectors

Page 20: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

20

CDBN vectors

Unsupervised feature learning for audio classification using convolutional deep belifef networks, H. Lee et al., Advances in Neural Information Processing Systems, 22:1096–1104, 2009

Page 21: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

21

DL ‘supervector like’ estimation

Page 22: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

22

Tasks

Page 23: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

SoA Speaker Diarization

Page 24: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

24

DL in Speaker Diarization

Page 25: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

25

DL Feature Classification

Page 26: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

Speaker Clustering: Speaker Comparison

Shallow Speaker Comparison

Harsha et al. “Artificial Neural Network Features for Speaker Diarization”. IEEE Spoken Language Technology Workshop. (2014) 402-406

Speaker errors obtained on AMI and ICSI datasets for matched and mismatched training conditions. MFCC corresponds to baseline clustering using BIC. ANN+MFCC is referred to the ANN shown in right figure.

Paper 9’

Page 27: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

27

DL ‘speaker-vectors’

Page 28: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

Speaker Embeddings

Mickael Rouvier et al. “Speaker Diarization trough Speaker Embeddings”. 23rd European Signal Processing Conference. (2015)

Page 29: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

Speaker Embeddings

500 size Speaker Embeddings rearranged in 10x50. Representation of two utterances from each speaker.

2D projection of four Speaker Embeddings using PCA.

Page 30: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

30

DL ‘speaker-vectors’

Page 31: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

CNN BN Feature

Yanik Lukic et al. “Speaker Identification and Clustering using Convolutional Neural Networks”. In 2016 IEEE International workshop on machine learning for signal processing. (2016)

Five Speaker representations in 2 dimensions. Left figure show the output vector of the softmax layer L8. Right figure correspond to the same output vector of L5 dense layer. Differents colors are assigned to different speakers.

Page 32: Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

CNN BN Features

● L5 and L7 size depend proportionally to the number of speakers.● L5 and L7 outperforms the softmax layer L8, where L7 is better than L5.● trainning data (speaker ammount ) must be above 10 * (# speakers) for a good performance.